Provenance

Provenance#

Every tree sequence has provenance information associated with it. The purpose of this information is to improve reproducibility: given the provenance associated with a given tree sequence, it should be possible to reproduce it. Provenance is split into three sections: the primary software used to produce a tree sequence; the parameters provided to this software; and the computational environment where the software was run.

This documentation serves two distinct purposes:

For developers using tskit in their own applications, it provides normative documentation for how provenance information should be stored.
For end-users of tskit, it provides documentation to allows them to inspect and interpret the provenance information stored in .trees files.

Provenance information is encoded using JSON. To standardise the provenance information produced by different software and improve interoperability we define a formal specification using JSON Schema. The full schema is provided below, which may be used to automatically validate input. In the following we describe the intention of the various sections in more detail.

This document defines specification version 1.0.0. Specification version numbers follow SemVer semantics.

Example#

To make things more concrete, let’s consider an example:

{
  "schema_version": "1.0.0",
  "software": {
    "name": "msprime",
    "version": "0.6.1.dev123+ga252341.d20180820"
  },
  "parameters": {
    "sample_size": 5,
    "random_seed": 12345,
    "command": "simulate"
  },
  "environment": {
    "libraries": {
      "gsl": {
        "version": "2.1"
      },
      "kastore": {
        "version": "0.1.0"
      }
    },
    "python": {
      "version": "3.5.2",
      "implementation": "CPython"
    },
    "os": {
      "system": "Linux",
      "node": "powderfinger",
      "release": "4.15.0-29-generic",
      "version": "#31~16.04.1-Ubuntu SMP Wed Jul 18 08:54:04 UTC 2018",
      "machine": "x86_64"
    }
  },
  "resources": {
    "elapsed_time": 12.34,
    "user_time": 10.56,
    "sys_time": 1.78,
    "max_memory": 1048576
  }
}

This information records the provenance for a very simple msprime simulation. The record is a JSON object with three mandatory fields (“software”, “parameters” and “environment”) and one optional (“resources”) which we discuss separately in the following sections.

Software#

Every tree sequence is produced by some piece of software. For example, this may be a coalescent simulation produced by msprime, a forwards-time simulation from SLiM or tree sequence inferred from data by tsinfer. The software provenance is intended to capture the details about this primary software.

Field	Type	Description
name	string	The name of the software.
version	string	The software version.

Note that libraries that the primary software links against are considered part of the Environment and should be recorded there.

Parameters#

The parameters section of a provenance document records the input that was used to produce a particular tree sequence. There are no requirements on what may be stored within it, but we make some recommendations here on how to encode such information.

As a general principle, sufficient information should be recorded in the parameters section to allow the output tree sequence to be reproduced exactly. There will be instances, however, where this is not possible due to missing files, issues with numerical precision and so on.

API invocations#

Consider an API call like the following simple msprime simulation:

ts = msprime.simulate(sample_size=10, recombination_rate=2)

We recommend encoding the parameters provenance as follows (other fields omitted for clarity):

{
  "parameters": {
    "command": "simulate",
    "sample_size": 10,
    "recombination_rate": 2,
    "random_seed": 123456789,
  }
}

Specifically, we encode the name of the function using the command key and the function parameters in the obvious way. Note that we include the random_seed here even though it was automatically generated.

CLI invocations#

Consider the following invocation of a hypothetical command line program:

$ supersim --sample-size=10 --do-some-stuff -O out.trees

We recommend encoding the parameters provenance as follows (other fields omitted for clarity):

{
  "parameters": {
    "command": "supersim",
    "args": ["--sample-size=10", "--do-some-stuff", "-O", "out.trees"],
    "random_seed": 56789
  }
}

Here we encode the name of the program using the command key and its command line arguments as a list of strings in the args key. We also include the automatically generated random seed in the parameters list.

If parameters that affect the output tree sequence are derived from environment variables these should also be recorded.

Environment#

The environment section captures details about the computational environment in which the software was executed. Two optional fields are defined: os and libraries. We recommend including any additional relevant platform information here; for example, if using Python store the interpreter information as shown in the example above.

Operating system#

The os section records details about the operating system on which the software was executed. This section is optional and has no required internal structure. We recommend the following structure based on the output of the POSIX uname function:

{
  "environment": {
    "os": {
      "system": "Linux",
      "node": "powderfinger",
      "release": "4.15.0-29-generic",
      "version": "#31~16.04.1-Ubuntu SMP Wed Jul 18 08:54:04 UTC 2018",
      "machine": "x86_64"
    }
}

Libraries#

The libraries section captures information about important libraries that the primary software links against. There is no required structure.

Resources#

The resources section captures details about the computational resources used during the execution of the software. This section is optional and has the following fields, each of which is optional and may not be filled depending on os support:

elapsed_time: The total elapsed time in seconds.
user_time: The total user CPU time in seconds.
sys_time: The total system CPU time in seconds.
max_memory: The maximum memory usage in bytes.

Including this information makes it easy for users of tree-sequence producing software to account for resource usage across pipelines of tools.

Full schema#

This schema is formally defined using JSON Schema and given in full here. Developers writing provenance information to .trees files should validate the output JSON against this schema.

{
  "schema": "http://json-schema.org/draft-07/schema#",
  "version": "1.1.0",
  "title": "tskit provenance",
  "description": "The combination of software, parameters and environment that produced a tree sequence",
  "type": "object",
  "required": ["schema_version", "software", "parameters", "environment"],
  "properties": {
    "schema_version": {
      "description": "The version of this schema used.",
      "type": "string",
      "minLength": 1
    },
    "software": {
      "description": "The primary software used to produce the tree sequence.",
      "type": "object",
      "required": ["name", "version"],
      "properties": {
        "name": {
          "description": "The name of the primary software.",
          "type": "string",
          "minLength": 1
        },
        "version": {
          "description": "The version of primary software.",
          "type": "string",
          "minLength": 1
        }
      }
    },
    "parameters": {
      "description": "The parameters used to produce the tree sequence.",
      "type": "object"
    },
    "environment": {
      "description": "The computational environment within which the primary software ran.",
      "type": "object",
      "properties": {
        "os": {
          "description": "Operating system.",
          "type": "object"
        },
        "libraries": {
          "description": "Details of libraries the primary software linked against.",
          "type": "object"
        }
      }
    },
    "resources": {
      "description": "Resources used by this operation.",
      "type": "object",
      "properties": {
        "elapsed_time": {
          "description": "Wall clock time in used in seconds.",
          "type": "number"
        },
        "user_time": {
          "description": "User time used in seconds.",
          "type": "number"
        },
        "sys_time": {
          "description": "System time used in seconds.",
          "type": "number"
        },
        "max_memory": {
          "description": "Maximum memory used in bytes.",
          "type": "number"
        }
      }
    }
  }
}