Python API#

This page provides formal documentation for the tsdate Python API.

Running tsdate#

tsdate.date(tree_sequence, *, mutation_rate, recombination_rate=None, time_units=None, method=None, constr_iterations=None, return_posteriors=None, return_likelihood=None, progress=None, record_provenance=True, **kwargs)[source]#

Infer dates for nodes in a genealogical graph (or ARG) stored in the succinct tree sequence format. New times are assigned to nodes using the estimation algorithm specified by method (see note below). If a mutation_rate is given, the mutation clock is used. The recombination clock is unsupported at this time. If neither a mutation_rate nor a recombination_rate is given, a topology-only clock is used. Times associated with mutations and times associated with non-fixed (non-sample) nodes are overwritten. For example:

mu = 1e-8
Ne = ts.diversity()/4/mu  #  In the absence of external info, use ts for prior Ne
new_ts = tsdate.date(ts, mutation_rate=mu, population_size=Ne)

Note

This is a wrapper for the named functions that are listed in estimation_methods. Details and specific parameters for each estimation method are given in the documentation for those functions.

Parameters:
  • tree_sequence (TreeSequence) – The input tree sequence to be dated (for example one with uncalibrated node times).

  • mutation_rate (float) – The estimated mutation rate per unit of genome per unit time (see individual methods)

  • recombination_rate (float) – The estimated recombination rate per unit of genome per unit time. If provided, the dating algorithm will use a recombination rate clock to help estimate node dates. Default: None (not currently implemented)

  • time_units (str) – The time units used by the mutation_rate and recombination_rate values, and stored in the time_units attribute of the output tree sequence. If the conditional coalescent prior is used, then this is also applies to the value of population_size, which in standard coalescent theory is measured in generations. Therefore if you wish to use mutation and recombination rates measured in (say) years, and are using the conditional coalescent prior, the population_size value which you provide must be scaled by multiplying by the number of years per generation. If None (default), assume "generations".

  • method (string) – What estimation method to use. See estimation_methods for possible values. If None (default) the “variational_gamma” method is currently chosen.

  • return_posteriors (bool) – If True, instead of returning just a dated tree sequence, return a tuple of (dated_ts, posteriors). Default: None, treated as False.

  • constr_iterations (int) – The maximum number of constrained least squares iterations to use prior to forcing positive branch lengths. Default: None, treated as 0.

  • return_likelihood (bool) – If True, return the log marginal likelihood from the inside algorithm in addition to the dated tree sequence. If return_posteriors is also True, then the marginal likelihood will be the last element of the tuple. Default: None, treated as False.

  • progress (bool) – Show a progress bar. Default: None, treated as False.

  • record_provenance (bool) – Should the tsdate command be appended to the provenence information in the returned tree sequence? Default: None, treated as True.

  • **kwargs – Other keyword arguments specific to the estimation method used. These are documented in those specific functions.

Returns:

A copy of the input tree sequence but with updated node times, or (if return_posteriors or return_likelihood is True) a tuple of that tree sequence plus a dictionary of posterior probabilities and/or the marginal likelihood given the mutations on the tree sequence.

tsdate.core.estimation_methods#

The names of available estimation methods, each mapped to a function to carry out the appropriate method. Names can be passed as strings to the date() function, or each named function can be called directly:

tsdate.variational_gamma(tree_sequence, *, mutation_rate, eps=None, max_iterations=None, rescaling_intervals=None, **kwargs)[source]#

Infer dates for nodes in a tree sequence using expectation propagation, which approximates the marginal posterior distribution of a given node’s age with a gamma distribution. Convergence to the correct posterior moments is obtained by updating the distributions for node dates using several rounds of iteration. For example:

new_ts = tsdate.variational_gamma(ts, mutation_rate=1e-8, max_iterations=10)

A piecewise-constant uniform distribution is used as a prior for each node, that is updated via expectation maximization in each iteration. Node-specific priors are not currently supported.

Parameters:
  • tree_sequence (TreeSequence) – The input tree sequence to be dated.

  • mutation_rate (float) – The estimated mutation rate per unit of genome per unit time.

  • eps (float) – The minimum distance separating parent and child ages in the returned tree sequence. Default: None, treated as 1e-6

  • max_iterations (int) – The number of iterations used in the expectation propagation algorithm. Default: None, treated as 10.

  • rescaling_intervals (float) – For time rescaling, the number of time intervals within which to estimate a rescaling parameter. Setting this to zero means that rescaling is not performed. Default None, treated as 1000.

  • rescaling_iterations (float) – The number of iterations for time rescaling. Setting this to zero means that rescaling is not performed. Default None, treated as 5.

  • match_segregating_sites (bool) – If True, then time is rescaled such that branch- and site-mode segregating sites are approximately equal. If False, time is rescaled such that branch- and site-mode root-to-leaf length are approximately equal, which gives unbiased estimates when there are polytomies. Default False.

  • **kwargs – Other keyword arguments as described in the date() wrapper function, including time_units, progress, and record_provenance. The arguments return_posteriors and return_likelihood can be used to return additional information (see below).

Returns:

  • ts (TreeSequence) – a copy of the input tree sequence with updated node times based on the posterior mean, corrected where necessary to ensure that parents are strictly older than all their children by an amount given by the eps parameter.

  • posteriors (dict) – (Only returned if return_posteriors is True) A dictionary of posterior probabilities. Each node whose time was inferred corresponds to an item in this dictionary whose key is the node ID and value is an array of the [shape, rate] parameters of the posterior gamma distribution for that node. When read as a pandas DataFrame object using pd.DataFrame(posteriors), the first row of the data frame is the shape and the second the rate parameter, each column being headed by the respective node ID.

  • marginal_likelihood (float) – (Only returned if return_likelihood is True) The marginal likelihood of the mutation data given the inferred node times. Not currently implemented for this method (set to None)

tsdate.inside_outside(tree_sequence, *, mutation_rate, population_size=None, priors=None, eps=None, num_threads=None, outside_standardize=None, ignore_oldest_root=None, probability_space=None, **kwargs)[source]#

Infer dates for nodes in a genealogical graph using the “inside outside” algorithm. This approximates the marginal posterior distribution of a node’s age using an atomic discretization of time (e.g. point masses at particular timepoints).

Currently, this estimation method comprises a single “inside” followed by a similar “outside” step. The inside step passes backwards in time from the samples to the roots of the graph,taking account of the distributions of times of each node’s child (and if a mutation_rate is given, the the number of mutations on each edge). The outside step passes forwards in time from the roots, incorporating the time distributions for each node’s parents. If there are (undirected) cycles in the underlying graph, this method does not provide a theoretically exact estimate of the marginal posterior distribution of node ages, but in practice it results in an accurate approximation.

For example:

new_ts = tsdate.inside_outside(ts, mutation_rate=1e-8, population_size=1e4)

Note

The prior parameters for each node-to-be-dated take the form of probabilities for each node at a set of discrete timepoints. If the priors parameter is used, it must specify an object constructed using build_prior_grid() (this can be used to define the number and position of the timepoints). If priors is not used, population_size must be provided, which is used to create a default prior derived from the conditional coalescent (tilted according to population size and weighted by the genomic span over which a node has a given number of descendant samples). This default prior assumes the nodes to be dated are all the non-sample nodes in the input tree sequence, and that they are contemporaneous.

Parameters:
  • tree_sequence (TreeSequence) – The input tree sequence to be dated.

  • mutation_rate (float) – The estimated mutation rate per unit of genome per unit time. If provided, the dating algorithm will use a mutation rate clock to help estimate node dates. Default: None

  • population_size (float or PopulationSizeHistory) – The estimated (diploid) effective population size used to construct the (default) conditional coalescent prior. For a population with constant size, this can be given as a single value (for example, as commonly estimated by the observed genetic diversity of the sample divided by four-times the expected mutation rate). Alternatively, for a population with time-varying size, this can be given directly as a PopulationSizeHistory object or a parameter dictionary passed to initialise a PopulationSizeHistory object. The population_size parameter is only used when priors is None. Conversely, if priors is not None, no population_size value should be specified.

  • priors (tsdate.base.NodeGridValues) – NodeGridValues object containing the prior parameters for each node-to-be-dated. Note that different estimation methods may require different types of prior, as described in the documentation for each estimation method.

  • eps (float) – The error factor in time difference calculations, and the minimum distance separating parent and child ages in the returned tree sequence. Default: None, treated as 1e-6.

  • num_threads (int) – The number of threads to use when precalculating likelihoods. A simpler unthreaded algorithm is used unless this is >= 1. Default: None

  • outside_standardize (bool) – Should the likelihoods be standardized during the outside step? This can help to avoid numerical under/overflow. Using unstandardized values is mostly useful for testing (e.g. to obtain, in the outside step, the total functional value for each node). Default: None, treated as True.

  • ignore_oldest_root (bool) – Should the oldest root in the tree sequence be ignored in the outside algorithm (if "inside_outside" is used as the method). Ignoring outside root can provide greater stability when dating tree sequences inferred from real data, in particular if all local trees are assumed to coalesce in a single “grand MRCA”, as in older versions of tsinfer. Default: None, treated as False.

  • probability_space (string) – Should the internal algorithm save probabilities in “logarithmic” (slower, less liable to to overflow) or “linear” space (fast, may overflow). Default: “logarithmic”

  • **kwargs – Other keyword arguments as described in the date() wrapper function, notably mutation_rate, and population_size or priors. Further arguments include time_units, progress, and record_provenance. The additional arguments return_posteriors and return_likelihood can be used to return additional information (see below).

Returns:

  • ts (TreeSequence) – a copy of the input tree sequence with updated node times based on the posterior mean, corrected where necessary to ensure that parents are strictly older than all their children by an amount given by the eps parameter.

  • posteriors (dict) – (Only returned if return_posteriors is True) A dictionary of posterior probabilities. Each node whose time was inferred corresponds to an item in this dictionary whose key is the node ID and value is an array of probabilities of the node being at a list of timepoints. Timepoint values are provided in the returned dictionary under the key named “time”. When read as a pandas DataFrame object using pd.DataFrame(posteriors), the rows correspond to labelled timepoints and columns are headed by their respective node ID.

  • marginal_likelihood (float) – (Only returned if return_likelihood is True) The marginal likelihood of the mutation data given the inferred node times.

tsdate.maximization(tree_sequence, *, mutation_rate, population_size=None, priors=None, eps=None, num_threads=None, probability_space=None, **kwargs)[source]#

Infer dates for nodes in a genealogical graph using the “outside maximization” algorithm. This approximates the marginal posterior distribution of a node’s age using an atomic discretization of time (e.g. point masses at particular timepoints).

This estimation method comprises a single “inside” step followed by an “outside maximization” step. The inside step passes backwards in time from the samples to the roots of the graph,taking account of the distributions of times of each node’s child (and if a mutation_rate is given, the the number of mutations on each edge). The outside maximization step passes forwards in time from the roots, updating each node’s time on the basis of the most likely timepoint for each parent of that node. This provides a reasonable point estimate for node times, but does not generate a true posterior time distribution.

For example:

new_ts = tsdate.maximization(ts, mutation_rate=1e-8, population_size=1e4)

Note

The prior parameters for each node-to-be-dated take the form of probabilities for each node at a set of discrete timepoints. If the priors parameter is used, it must specify an object constructed using build_prior_grid() (this can be used to define the number and position of the timepoints). If priors is not used, population_size must be provided, which is used to create a default prior derived from the conditional coalescent (tilted according to population size and weighted by the genomic span over which a node has a given number of descendant samples). This default prior assumes the nodes to be dated are all the non-sample nodes in the input tree sequence, and that they are contemporaneous.

Parameters:
  • tree_sequence (TreeSequence) – The input tree sequence to be dated.

  • mutation_rate (float) – The estimated mutation rate per unit of genome per unit time. If provided, the dating algorithm will use a mutation rate clock to help estimate node dates. Default: None

  • population_size (float or PopulationSizeHistory) – The estimated (diploid) effective population size used to construct the (default) conditional coalescent prior. For a population with constant size, this can be given as a single value (for example, as commonly estimated by the observed genetic diversity of the sample divided by four-times the expected mutation rate). Alternatively, for a population with time-varying size, this can be given directly as a PopulationSizeHistory object or a parameter dictionary passed to initialise a PopulationSizeHistory object. The population_size parameter is only used when priors is None. Conversely, if priors is not None, no population_size value should be specified.

  • priors (tsdate.base.NodeGridValues) – NodeGridValues object containing the prior parameters for each node-to-be-dated. Note that different estimation methods may require different types of prior, as described in the documentation for each estimation method.

  • eps (float) – The error factor in time difference calculations, and the minimum distance separating parent and child ages in the returned tree sequence. Default: None, treated as 1e-6.

  • num_threads (int) – The number of threads to use when precalculating likelihoods. A simpler unthreaded algorithm is used unless this is >= 1. Default: None

  • probability_space (string) – Should the internal algorithm save probabilities in “logarithmic” (slower, less liable to to overflow) or “linear” space (fast, may overflow). Default: None treated as”logarithmic”

  • **kwargs – Other keyword arguments as described in the date() wrapper function, notably mutation_rate, and population_size or priors. Further arguments include time_units, progress, and record_provenance. The additional return_likelihood argument can be used to return additional information (see below). Posteriors cannot be returned using this estimation method.

Returns:

  • ts (TreeSequence) – a copy of the input tree sequence with updated node times based on the posterior mean, corrected where necessary to ensure that parents are strictly older than all their children by an amount given by the eps parameter.

  • marginal_likelihood (float) – (Only returned if return_likelihood is True) The marginal likelihood of the mutation data given the inferred node times.

Prior and Time Discretisation Options#

tsdate.build_prior_grid(tree_sequence, population_size, timepoints=20, *, approximate_priors=False, approx_prior_size=None, prior_distribution='lognorm', progress=False, allow_unary=False)#

Using the conditional coalescent, calculate the prior distribution for the age of each node, given the number of contemporaneous samples below it, and the discretised time slices at which to evaluate node age.

Parameters:
  • tree_sequence (tskit.TreeSequence) – The input tskit.TreeSequence, treated as undated.

  • population_size (float or demography.PopulationSizeHistory) – The estimated (diploid) effective population size used to construct the prior. For a population with constant size, this can be given as a single value. For a population with time-varying size, this can be given directly as a PopulationSizeHistory object or a parameter dictionary passed to initialise a PopulationSizeHistory object. Using standard (unscaled) values for population_size results in a prior where times are measured in generations.

  • timepoints (int or array_like) – The number of quantiles used to create the time slices, or manually-specified time slices as a numpy array. Default: 20

  • approximate_priors (bool) – Whether to use a precalculated approximation to the treewise conditional coalescent prior if there are large numbers of sample tips. If an approximate prior has not been precalculated, tsdate will do so and cache the result. Default: False

  • approx_prior_size (int) – Number of samples above which a precalculated prior is used. Only valid if approximate_priors is True. Default: None, treated as DEFAULT_APPROX_PRIOR_SIZE if approximate_priors is True.

  • prior_distr (string) – What distribution to use to approximate the conditional coalescent prior. Can be “lognorm” for the lognormal distribution (generally a better fit, but slightly slower to calculate) or “gamma” for the gamma distribution (slightly faster, but a poorer fit for recent nodes). Default: “lognorm”

Returns:

A prior object to pass to date() and similar functions containing prior values for inference and a discretised time grid

Return type:

base.NodeGridValues

tsdate.build_parameter_grid(tree_sequence, population_size, *, approximate_priors=False, approx_prior_size=None, progress=False, allow_unary=False)#

Using the conditional coalescent, calculate the prior distribution for the age of each node, given the number of contemporaneous samples below it, and return parameters (shape and rate of gamma) in a grid

Parameters:
  • tree_sequence (tskit.TreeSequence) – The input tree sequence, treated as undated.

  • population_size (float) – The estimated (diploid) effective population size: must be specified. May be a single value, or a two-column array with epoch breakpoints and effective population sizes. Using standard (unscaled) values for population_size results in a prior where times are measured in generations.

  • approximate_priors (bool) – Whether to use a precalculated approximation to the treewise conditional coalescent prior if there are large numbers of sample tips. If an approximate prior has not been precalculated, tsdate will do so and cache the result. Default: False

  • approx_prior_size (int) – Number of samples above which a precalculated prior is used. Only valid if approximate_priors is True. Default: None, treated as DEFAULT_APPROX_PRIOR_SIZE if approximate_priors is True.

Return type:

base.NodeGridValues

class tsdate.base.NodeGridValues(num_nodes, nonfixed_nodes, timepoints, fill_value=nan, dtype=<class 'numpy.float64'>)[source]#

A class to store times or discretised distributions of times for node ids. For nodes with fixed times, only a single time value needs to be stored. For non-fixed nodes, an array of len(timepoints) probabilies is required.

Note

This class is not intended to be used directly by users and may be subject to change of name or internal structure in future versions. For details on how to create a NodeGridValues object to be used as a prior, see More on priors (old).

Variables:
  • num_nodes (int) – The number of nodes that will be stored in this object

  • nonfixed_nodes (numpy.ndarray) – a (possibly empty) numpy array of unique positive node ids each of which must be less than num_nodes. Each will have an array of grid_size associated with it. All others (up to num_nodes) will be associated with a single scalar value instead.

  • timepoints (numpy.ndarray) – Array of time points

  • fill_value (float) – What should we fill the data arrays with to start with

tsdate.base.DEFAULT_APPROX_PRIOR_SIZE = 10000#

The default value for approx_prior_size (see build_prior_grid() and build_parameter_grid())

Variable population sizes#

class tsdate.demography.PopulationSizeHistory(population_size, time_breaks=None)[source]#

Stores a piecewise constant population size history and tranforms time from a natural (generational) scale to a coalescent one.

Preprocessing Tree Sequences#

tsdate.preprocess_ts(tree_sequence, *, minimum_gap=None, remove_telomeres=None, delete_intervals=None, split_disjoint=None, filter_populations=False, filter_individuals=False, filter_sites=False, record_provenance=None, **kwargs)[source]#

Function to prepare tree sequences for dating by modifying the tree sequence to increase the accuracy of dating. This can involve removing data-poor regions, removing locally-unary segments of nodes via simplification, and splitting discontinuous nodes.

Parameters:
  • tree_sequence (tskit.TreeSequence) – The input tree sequence to be preprocessed.

  • minimum_gap (float) – The minimum gap between sites to remove from the tree sequence. Default: None treated as 1000000. Removed regions are recorded in the provenance of the resulting tree sequence.

  • remove_telomeres (bool) – Should all material before the first site and after the last site be removed, regardless of the length. Default: None treated as True

  • delete_intervals (array_like) – A list (start, end) pairs describing the genomic intervals (gaps) to delete. This is usually left as None (the default) in which case minimum_gap and remove_telomeres are used to determine the gaps to remove, and the calculated intervals are recorded in the provenance of the resulting tree sequence.

  • split_disjoint (bool) – Run the {func}`split_disjoint_nodes` function on the returned tree sequence, breaking any disjoint node into nodes that can be dated separately (Default: None treated as True).

  • filter_populations (bool) – parameter passed to the {meth}`tskit.TreeSequence.simplify` command. Unlike calling that command directly, this defaults to False, such that all populations in the tree sequence are kept.

  • filter_individuals (bool) – parameter passed to the {meth}`tskit.TreeSequence.simplify` command. Unlike calling that command directly, this defaults to False, such that all individuals in the tree sequence are kept.

  • filter_sites (bool) – parameter passed to the {meth}`tskit.TreeSequence.simplify` command. Unlike calling that command directly, this defaults to False, such that all sites in the tree sequence are kept.

  • record_provenance (bool) – If True, record details of this call to simplify in the returned tree sequence’s provenance information (Default: None treated as True).

  • **kwargs – All further keyword arguments are passed to the {meth}`tskit.TreeSequence.simplify` command.

Returns:

A tree sequence with gaps removed.

Return type:

tskit.TreeSequence

tsdate.util.split_disjoint_nodes(ts, *, record_provenance=None)[source]#

For each non-sample node, split regions separated by gaps into distinct nodes, returning a tree sequence with potentially duplicated nodes.

Where there are multiple disconnected regions, the leftmost one is assigned the ID of the original node, and the remainder are assigned new node IDs. Population, flags, individual, time, and metadata are all copied into the new nodes. Nodes that have been split will be flagged with tsdate.NODE_SPLIT_BY_PREPROCESS. The metadata of these nodes will also be updated with an unsplit_node_id field giving the node ID in the input tree sequence to which they correspond. If this metadata cannot be set, a warning is emitted.

Parameters:

record_provenance (bool) – If True, record details of this call in the returned tree sequence’s provenance information (Default: None treated as True).

Functions for Inferring Tree Sequences with Historical Samples#

tsdate.sites_time_from_ts(tree_sequence, *, unconstrained=True, node_selection='child', min_time=1)[source]#

Returns an estimated “time” for each site. This is the estimated age of the oldest MRCA which possesses a derived variant at that site, and is useful for performing (re)inference of a tree sequence. It is calculated from the ages of nodes, with the appropriate nodes identified by the position of mutations in the trees.

If node times in the tree sequence have been estimated by tsdate using the inside-outside algorithm, then as well as a time in the tree sequence, nodes will store additional time estimates that have not been explictly constrained by the tree topology. By default, this function tries to use these “unconstrained” times, although this is likely to fail (with a warning) on tree sequences that have not been processed by tsdate: in this case the standard node times can be used by setting unconstrained=False.

The concept of a site time is meaningless for non-variable sites, and so the returned time for these sites is np.nan (note that this is not exactly the same as tskit.UNKNOWN_TIME, which marks sites that could have a meaningful time but whose time estimate is unknown).

Parameters:
  • tree_sequence (tskit.TreeSequence) – The input tree sequence.

  • unconstrained (bool) – Use estimated node times which have not been constrained by tree topology. If True (default), this requires a tree sequence which has been dated using the tsdate inside-outside algorithm. If this is not the case, specify False to use the standard tree sequence node times.

  • node_selection (str) –

    Defines how site times are calculated from the age of the upper and lower nodes that bound each mutation at the site. Options are “child”, “parent”, “arithmetic” or “geometric”, with the following meanings

    • 'child' (default): the site time is the age of the oldest node below each mutation at the site

    • 'parent': the site time is the age of the oldest node above each mutation at the site

    • 'arithmetic': the arithmetic mean of the ages of the node above and the node below each mutation is calculated; the site time is the oldest of these means.

    • 'geometric': the geometric mean of the ages of the node above and the node below each mutation is calculated; the site time is the oldest of these means

  • min_time (float) – A site time of zero implies that no MRCA in the past possessed the derived variant, so the variant cannot be used for inferring relationships between the samples. To allow all variants to be potentially available for inference, if a site time would otherwise be calculated as zero (for example, where the mutation_age parameter is “child” or “geometric” and all mutations at a site are associated with leaf nodes), a minimum site greater than 0 is recommended. By default this is set to 1, which is generally reasonable for times measured in generations or years, although it is also fine to set this to a small epsilon value.

Returns:

Array of length tree_sequence.num_sites with estimated time of each site

Return type:

numpy.ndarray(dtype=np.float64)

tsdate.add_sampledata_times(samples, sites_time)[source]#

Return a tsinfer.SampleData file with estimated times associated with sites. Ensures that each site’s time is at least as old as the oldest historic sample carrying a derived allele at that site.

Parameters:

samples (tsinfer.formats.SampleData) – A tsinfer SampleData object to add site times to. Any historic individuals in this SampleData file are used to constrain site times.

Returns:

A tsinfer.SampleData file

Return type:

tsinfer.SampleData

Constants#

tsdate.NODE_IS_HISTORICAL_SAMPLE = 1048576#

Node flag value indicating that this is a non-contemporary sample node

tsdate.NODE_SPLIT_BY_PREPROCESS = 1073741824#

Node flag value indicating that this was a disjoint node that was then split