Python API#
This page provides formal documentation for the tsdate
Python API.
Running tsdate#
- tsdate.date(tree_sequence, mutation_rate, population_size=None, recombination_rate=None, time_units=None, priors=None, method=None, *, constr_iterations=None, return_posteriors=None, return_likelihood=None, progress=None, record_provenance=True, Ne=None, **kwargs)[source]#
Infer dates for nodes in a genealogical graph (or ARG) stored in the succinct tree sequence format. New times are assigned to nodes using the estimation algorithm specified by
method
(see note below). If amutation_rate
is given, the mutation clock is used. The recombination clock is unsupported at this time. If neither amutation_rate
nor arecombination_rate
is given, a topology-only clock is used. Times associated with mutations and times associated with non-fixed (non-sample) nodes are overwritten. For example:mu = 1e-8 Ne = ts.diversity()/4/mu # In the absence of external info, use ts for prior Ne new_ts = tsdate.date(ts, mutation_rate=mu, population_size=Ne)
Note
This is a wrapper for the named functions that are listed in
estimation_methods
. Details and specific parameters for each estimation method are given in the documentation for those functions.- Parameters:
tree_sequence (TreeSequence) – The input tree sequence to be dated (for example one with
uncalibrated
node times).population_size (float or PopulationSizeHistory) – The estimated (diploid) effective population size used to construct the (default) conditional coalescent prior. For a population with constant size, this can be given as a single value (for example, as commonly estimated by the observed genetic diversity of the sample divided by four-times the expected mutation rate). Alternatively, for a population with time-varying size, this can be given directly as a
PopulationSizeHistory
object or a parameter dictionary passed to initialise aPopulationSizeHistory
object. Thepopulation_size
parameter is only used whenpriors
isNone
. Conversely, ifpriors
is notNone
, nopopulation_size
value should be specified.mutation_rate (float) – The estimated mutation rate per unit of genome per unit time. If provided, the dating algorithm will use a mutation rate clock to help estimate node dates. Default:
None
recombination_rate (float) – The estimated recombination rate per unit of genome per unit time. If provided, the dating algorithm will use a recombination rate clock to help estimate node dates. Default:
None
(not currently implemented)time_units (str) – The time units used by the
mutation_rate
andrecombination_rate
values, and stored in thetime_units
attribute of the output tree sequence. If the conditional coalescent prior is used, then this is also applies to the value ofpopulation_size
, which in standard coalescent theory is measured in generations. Therefore if you wish to use mutation and recombination rates measured in (say) years, and are using the conditional coalescent prior, thepopulation_size
value which you provide must be scaled by multiplying by the number of years per generation. IfNone
(default), assume"generations"
.priors (tsdate.base.NodeGridValues) – NodeGridValues object containing the prior parameters for each node-to-be-dated. Note that different estimation methods may require different types of prior, as described in the documentation for each estimation method.
method (string) – What estimation method to use. See
estimation_methods
for possible values. IfNone
(default) the “inside_outside” method is currently chosen.return_posteriors (bool) – If
True
, instead of returning just a dated tree sequence, return a tuple of(dated_ts, posteriors)
. Default: None, treated as False.constr_iterations (int) – The maximum number of constrained least squares iterations to use prior to forcing positive branch lengths. Default: None, treated as 0.
return_likelihood (bool) – If
True
, return the log marginal likelihood from the inside algorithm in addition to the dated tree sequence. Ifreturn_posteriors
is alsoTrue
, then the marginal likelihood will be the last element of the tuple. Default: None, treated as False.progress (bool) – Show a progress bar. Default: None, treated as False.
record_provenance (bool) – Should the tsdate command be appended to the provenence information in the returned tree sequence? Default: None, treated as True.
Ne (float) – Deprecated, use the``population_size`` argument instead.
**kwargs – Other keyword arguments specific to the
estimation method
used. These are documented in those specific functions.
- Returns:
A copy of the input tree sequence but with updated node times, or (if
return_posteriors
orreturn_likelihood
is True) a tuple of that tree sequence plus a dictionary of posterior probabilities and/or the marginal likelihood given the mutations on the tree sequence.
- tsdate.core.estimation_methods#
The names of available estimation methods, each mapped to a function to carry out the appropriate method. Names can be passed as strings to the
date()
function, or each named function can be called directly:tsdate.inside_outside()
: empirically better, theoretically problematic.tsdate.maximization()
: worse empirically, especially with gamma approximated priors, but theoretically robusttsdate.variational_gamma()
: variational approximation, empirically most accurate.
- tsdate.inside_outside(tree_sequence, *, eps=None, num_threads=None, outside_standardize=None, ignore_oldest_root=None, probability_space=None, cache_inside=False, **kwargs)[source]#
Infer dates for nodes in a genealogical graph using the “inside outside” algorithm. This approximates the marginal posterior distribution of a node’s age using an atomic discretization of time (e.g. point masses at particular timepoints).
Currently, this estimation method comprises a single “inside” followed by a similar “outside” step. The inside step passes backwards in time from the samples to the roots of the graph,taking account of the distributions of times of each node’s child (and if a
mutation_rate
is given, the the number of mutations on each edge). The outside step passes forwards in time from the roots, incorporating the time distributions for each node’s parents. If there are (undirected) cycles in the underlying graph, this method does not provide a theoretically exact estimate of the marginal posterior distribution of node ages, but in practice it results in an accurate approximation.For example:
new_ts = tsdate.inside_outside(ts, mutation_rate=1e-8, population_size=1e4)
Note
The prior parameters for each node-to-be-dated take the form of probabilities for each node at a set of discrete timepoints. If the
priors
parameter is used, it must specify an object constructed usingbuild_prior_grid()
(this can be used to define the number and position of the timepoints). Ifpriors
is not used,population_size
must be provided, which is used to create a default prior derived from the conditional coalescent (tilted according to population size and weighted by the genomic span over which a node has a given number of descendant samples). This default prior assumes the nodes to be dated are all the non-sample nodes in the input tree sequence, and that they are contemporaneous.- Parameters:
tree_sequence (TreeSequence) – The input tree sequence to be dated.
eps (float) – The error factor in time difference calculations, and the minimum distance separating parent and child ages in the returned tree sequence. Default: None, treated as 1e-6.
num_threads (int) – The number of threads to use when precalculating likelihoods. A simpler unthreaded algorithm is used unless this is >= 1. Default: None
outside_standardize (bool) – Should the likelihoods be standardized during the outside step? This can help to avoid numerical under/overflow. Using unstandardized values is mostly useful for testing (e.g. to obtain, in the outside step, the total functional value for each node). Default: None, treated as True.
ignore_oldest_root (bool) – Should the oldest root in the tree sequence be ignored in the outside algorithm (if
"inside_outside"
is used as the method). Ignoring outside root can provide greater stability when dating tree sequences inferred from real data, in particular if all local trees are assumed to coalesce in a single “grand MRCA”, as in older versions oftsinfer
. Default: None, treated as False.probability_space (string) – Should the internal algorithm save probabilities in “logarithmic” (slower, less liable to to overflow) or “linear” space (fast, may overflow). Default: “logarithmic”
**kwargs – Other keyword arguments as described in the
date()
wrapper function, notablymutation_rate
, andpopulation_size
orpriors
. Further arguments includetime_units
,progress
, andrecord_provenance
. The additional argumentsreturn_posteriors
andreturn_likelihood
can be used to return additional information (see below).
- Returns:
ts (
TreeSequence
) – a copy of the input tree sequence with updated node times based on the posterior mean, corrected where necessary to ensure that parents are strictly older than all their children by an amount given by theeps
parameter.posteriors (
dict
) – (Only returned ifreturn_posteriors
isTrue
) A dictionary of posterior probabilities. Each node whose time was inferred corresponds to an item in this dictionary whose key is the node ID and value is an array of probabilities of the node being at a list of timepoints. Timepoint values are provided in the returned dictionary under the key named “time”. When read as a pandasDataFrame
object usingpd.DataFrame(posteriors)
, the rows correspond to labelled timepoints and columns are headed by their respective node ID.marginal_likelihood (
float
) – (Only returned ifreturn_likelihood
isTrue
) The marginal likelihood of the mutation data given the inferred node times.
- tsdate.maximization(tree_sequence, *, eps=None, num_threads=None, probability_space=None, cache_inside=None, **kwargs)[source]#
Infer dates for nodes in a genealogical graph using the “outside maximization” algorithm. This approximates the marginal posterior distribution of a node’s age using an atomic discretization of time (e.g. point masses at particular timepoints).
This estimation method comprises a single “inside” step followed by an “outside maximization” step. The inside step passes backwards in time from the samples to the roots of the graph,taking account of the distributions of times of each node’s child (and if a
mutation_rate
is given, the the number of mutations on each edge). The outside maximization step passes forwards in time from the roots, updating each node’s time on the basis of the most likely timepoint for each parent of that node. This provides a reasonable point estimate for node times, but does not generate a true posterior time distribution.For example:
new_ts = tsdate.maximization(ts, mutation_rate=1e-8, population_size=1e4)
Note
The prior parameters for each node-to-be-dated take the form of probabilities for each node at a set of discrete timepoints. If the
priors
parameter is used, it must specify an object constructed usingbuild_prior_grid()
(this can be used to define the number and position of the timepoints). Ifpriors
is not used,population_size
must be provided, which is used to create a default prior derived from the conditional coalescent (tilted according to population size and weighted by the genomic span over which a node has a given number of descendant samples). This default prior assumes the nodes to be dated are all the non-sample nodes in the input tree sequence, and that they are contemporaneous.- Parameters:
tree_sequence (TreeSequence) – The input tree sequence to be dated.
eps (float) – The error factor in time difference calculations, and the minimum distance separating parent and child ages in the returned tree sequence. Default: None, treated as 1e-6.
num_threads (int) – The number of threads to use when precalculating likelihoods. A simpler unthreaded algorithm is used unless this is >= 1. Default: None
probability_space (string) – Should the internal algorithm save probabilities in “logarithmic” (slower, less liable to to overflow) or “linear” space (fast, may overflow). Default: None treated as”logarithmic”
**kwargs – Other keyword arguments as described in the
date()
wrapper function, notablymutation_rate
, andpopulation_size
orpriors
. Further arguments includetime_units
,progress
, andrecord_provenance
. The additionalreturn_likelihood
argument can be used to return additional information (see below). Posteriors cannot be returned using this estimation method.
- Returns:
ts (
TreeSequence
) – a copy of the input tree sequence with updated node times based on the posterior mean, corrected where necessary to ensure that parents are strictly older than all their children by an amount given by theeps
parameter.marginal_likelihood (
float
) – (Only returned ifreturn_likelihood
isTrue
) The marginal likelihood of the mutation data given the inferred node times.
- tsdate.variational_gamma(tree_sequence, *, eps=None, max_iterations=None, max_shape=None, rescaling_intervals=None, match_central_moments=None, match_segregating_sites=None, regularise_roots=None, **kwargs)[source]#
Infer dates for nodes in a tree sequence using expectation propagation, which approximates the marginal posterior distribution of a given node’s age with a gamma distribution. Convergence to the correct posterior moments is obtained by updating the distributions for node dates using several rounds of iteration. For example:
new_ts = tsdate.variational_gamma(ts, mutation_rate=1e-8, max_iterations=10)
An piecewise-constant uniform distribution is used as a prior for each node, that is updated via expectation maximization in each iteration. Node-specific priors are not currently supported.
- Parameters:
tree_sequence (TreeSequence) – The input tree sequence to be dated.
eps (float) – The minimum distance separating parent and child ages in the returned tree sequence. Default: None, treated as 1e-6
max_iterations (int) – The number of iterations used in the expectation propagation algorithm. Default: None, treated as 10.
max_shape (float) – The maximum value for the shape parameter in the variational posteriors. This is equivalent to the maximum precision (inverse variance) on a logarithmic scale. Default: None, treated as 1000.
rescaling_intervals (float) – For time rescaling, the number of time intervals within which to estimate a rescaling parameter. Default None, treated as 1000.
**kwargs – Other keyword arguments as described in the
date()
wrapper function, notablymutation_rate
, andpopulation_size
orpriors
. Further arguments includetime_units
,progress
, andrecord_provenance
. The additional argumentsreturn_posteriors
andreturn_likelihood
can be used to return additional information (see below).
- Returns:
ts (
TreeSequence
) – a copy of the input tree sequence with updated node times based on the posterior mean, corrected where necessary to ensure that parents are strictly older than all their children by an amount given by theeps
parameter.posteriors (
dict
) – (Only returned ifreturn_posteriors
isTrue
) A dictionary of posterior probabilities. Each node whose time was inferred corresponds to an item in this dictionary whose key is the node ID and value is an array of the[shape, rate]
parameters of the posterior gamma distribution for that node. When read as a pandasDataFrame
object usingpd.DataFrame(posteriors)
, the first row of the data frame is the shape and the second the rate parameter, each column being headed by the respective node ID.marginal_likelihood (
float
) – (Only returned ifreturn_likelihood
isTrue
) The marginal likelihood of the mutation data given the inferred node times. Not currently implemented for this method (set toNone
)
Prior and Time Discretisation Options#
- tsdate.build_prior_grid(tree_sequence, population_size, timepoints=20, *, approximate_priors=False, approx_prior_size=None, prior_distribution='lognorm', progress=False, allow_unary=False)#
Using the conditional coalescent, calculate the prior distribution for the age of each node, given the number of contemporaneous samples below it, and the discretised time slices at which to evaluate node age.
- Parameters:
tree_sequence (tskit.TreeSequence) – The input
tskit.TreeSequence
, treated as undated.population_size (float or demography.PopulationSizeHistory) – The estimated (diploid) effective population size used to construct the prior. For a population with constant size, this can be given as a single value. For a population with time-varying size, this can be given directly as a
PopulationSizeHistory
object or a parameter dictionary passed to initialise aPopulationSizeHistory
object. Using standard (unscaled) values forpopulation_size
results in a prior where times are measured in generations.timepoints (int or array_like) – The number of quantiles used to create the time slices, or manually-specified time slices as a numpy array. Default: 20
approximate_priors (bool) – Whether to use a precalculated approximation to the treewise conditional coalescent prior if there are large numbers of sample tips. If an approximate prior has not been precalculated, tsdate will do so and cache the result. Default: False
approx_prior_size (int) – Number of samples above which a precalculated prior is used. Only valid if
approximate_priors
is True. Default:None
, treated asDEFAULT_APPROX_PRIOR_SIZE
ifapproximate_priors
is True.prior_distr (string) – What distribution to use to approximate the conditional coalescent prior. Can be “lognorm” for the lognormal distribution (generally a better fit, but slightly slower to calculate) or “gamma” for the gamma distribution (slightly faster, but a poorer fit for recent nodes). Default: “lognorm”
- Returns:
A prior object to pass to
date()
and similar functions containing prior values for inference and a discretised time grid- Return type:
- tsdate.build_parameter_grid(tree_sequence, population_size, *, approximate_priors=False, approx_prior_size=None, progress=False, allow_unary=False)#
Using the conditional coalescent, calculate the prior distribution for the age of each node, given the number of contemporaneous samples below it, and return parameters (shape and rate of gamma) in a grid
- Parameters:
tree_sequence (tskit.TreeSequence) – The input tree sequence, treated as undated.
population_size (float) – The estimated (diploid) effective population size: must be specified. May be a single value, or a two-column array with epoch breakpoints and effective population sizes. Using standard (unscaled) values for
population_size
results in a prior where times are measured in generations.approximate_priors (bool) – Whether to use a precalculated approximation to the treewise conditional coalescent prior if there are large numbers of sample tips. If an approximate prior has not been precalculated, tsdate will do so and cache the result. Default: False
approx_prior_size (int) – Number of samples above which a precalculated prior is used. Only valid if
approximate_priors
is True. Default:None
, treated asDEFAULT_APPROX_PRIOR_SIZE
ifapproximate_priors
is True.
- Return type:
- class tsdate.base.NodeGridValues(num_nodes, nonfixed_nodes, timepoints, fill_value=nan, dtype=<class 'numpy.float64'>)[source]#
A class to store times or discretised distributions of times for node ids. For nodes with fixed times, only a single time value needs to be stored. For non-fixed nodes, an array of len(timepoints) probabilies is required.
Note
This class is not intended to be used directly by users and may be subject to change of name or internal structure in future versions. For details on how to create a
NodeGridValues
object to be used as a prior, see Specifying a Prior.- Variables:
num_nodes (int) – The number of nodes that will be stored in this object
nonfixed_nodes (numpy.ndarray) – a (possibly empty) numpy array of unique positive node ids each of which must be less than num_nodes. Each will have an array of grid_size associated with it. All others (up to num_nodes) will be associated with a single scalar value instead.
timepoints (numpy.ndarray) – Array of time points
fill_value (float) – What should we fill the data arrays with to start with
- tsdate.base.DEFAULT_APPROX_PRIOR_SIZE = 10000#
The default value for approx_prior_size (see
build_prior_grid()
andbuild_parameter_grid()
)
Variable population sizes#
- class tsdate.demography.PopulationSizeHistory(population_size, time_breaks=None)[source]#
Stores a piecewise constant population size history and tranforms time from a natural (generational) scale to a coalescent one. See Variable population sizes for details.
Preprocessing Tree Sequences#
- tsdate.preprocess_ts(tree_sequence, *, minimum_gap=None, remove_telomeres=None, filter_populations=False, filter_individuals=False, filter_sites=False, delete_intervals=None, **kwargs)[source]#
Function to prepare tree sequences for dating by removing gaps without sites and simplifying the tree sequence. Large regions without data can cause overflow/underflow errors in the inside-outside algorithm and poor performance more generally. Removed regions are recorded in the provenance of the resulting tree sequence.
- Parameters:
tree_sequence (tskit.TreeSequence) – The input tree sequence to be preprocessed.
minimum_gap (float) – The minimum gap between sites to remove from the tree sequence. Default:
None
treated as1000000
remove_telomeres (bool) – Should all material before the first site and after the last site be removed, regardless of the length. Default:
None
treated asTrue
filter_populations (bool) – parameter passed to the
tskit.simplify
command. Unlike calling that command directly, this defaults toFalse
, such that all populations in the tree sequence are kept.filter_individuals (bool) – parameter passed to the
tskit.simplify
command. Unlike calling that command directly, this defaults toFalse
, such that all individuals in the tree sequence are keptfilter_sites (bool) – parameter passed to the
tskit.simplify
command. Unlike calling that command directly, this defaults toFalse
, such that all sites in the tree sequence are keptdelete_intervals (array_like) – A list (start, end) pairs describing the genomic intervals (gaps) to delete. This is usually left as
None
(the default) in which caseminimum_gap
andremove_telomeres
are used to determine the gaps to remove, and the calculated intervals are recorded in the provenance of the resulting tree sequence.**kwargs – All further keyword arguments are passed to the
tskit.simplify
command.
- Returns:
A tree sequence with gaps removed.
- Return type:
Functions for Inferring Tree Sequences with Historical Samples#
- tsdate.sites_time_from_ts(tree_sequence, *, unconstrained=True, node_selection='child', min_time=1)[source]#
Returns an estimated “time” for each site. This is the estimated age of the oldest MRCA which possesses a derived variant at that site, and is useful for performing (re)inference of a tree sequence. It is calculated from the ages of nodes, with the appropriate nodes identified by the position of mutations in the trees.
If node times in the tree sequence have been estimated by
tsdate
using the inside-outside algorithm, then as well as a time in the tree sequence, nodes will store additional time estimates that have not been explictly constrained by the tree topology. By default, this function tries to use these “unconstrained” times, although this is likely to fail (with a warning) on tree sequences that have not been processed bytsdate
: in this case the standard node times can be used by settingunconstrained=False
.The concept of a site time is meaningless for non-variable sites, and so the returned time for these sites is
np.nan
(note that this is not exactly the same as tskit.UNKNOWN_TIME, which marks sites that could have a meaningful time but whose time estimate is unknown).- Parameters:
tree_sequence (tskit.TreeSequence) – The input tree sequence.
unconstrained (bool) – Use estimated node times which have not been constrained by tree topology. If
True
(default), this requires a tree sequence which has been dated using thetsdate
inside-outside algorithm. If this is not the case, specifyFalse
to use the standard tree sequence node times.node_selection (str) –
Defines how site times are calculated from the age of the upper and lower nodes that bound each mutation at the site. Options are “child”, “parent”, “arithmetic” or “geometric”, with the following meanings
'child'
(default): the site time is the age of the oldest node below each mutation at the site'parent'
: the site time is the age of the oldest node above each mutation at the site'arithmetic'
: the arithmetic mean of the ages of the node above and the node below each mutation is calculated; the site time is the oldest of these means.'geometric'
: the geometric mean of the ages of the node above and the node below each mutation is calculated; the site time is the oldest of these means
min_time (float) – A site time of zero implies that no MRCA in the past possessed the derived variant, so the variant cannot be used for inferring relationships between the samples. To allow all variants to be potentially available for inference, if a site time would otherwise be calculated as zero (for example, where the
mutation_age
parameter is “child” or “geometric” and all mutations at a site are associated with leaf nodes), a minimum site greater than 0 is recommended. By default this is set to 1, which is generally reasonable for times measured in generations or years, although it is also fine to set this to a small epsilon value.
- Returns:
Array of length tree_sequence.num_sites with estimated time of each site
- Return type:
numpy.ndarray(dtype=np.float64)
- tsdate.add_sampledata_times(samples, sites_time)[source]#
Return a tsinfer.SampleData file with estimated times associated with sites. Ensures that each site’s time is at least as old as the oldest historic sample carrying a derived allele at that site.
- Parameters:
samples (tsinfer.formats.SampleData) – A tsinfer SampleData object to add site times to. Any historic individuals in this SampleData file are used to constrain site times.
- Returns:
A tsinfer.SampleData file
- Return type: