Python API#
This page provides formal documentation for the tsdate Python API.
Running tsdate#
- tsdate.date(tree_sequence, *, mutation_rate, recombination_rate=None, time_units=None, method=None, constr_iterations=None, return_posteriors=None, return_likelihood=None, progress=None, record_provenance=True, **kwargs)[source]#
Infer dates for nodes in a genealogical graph (or ARG) stored in the succinct tree sequence format. New times are assigned to nodes using the estimation algorithm specified by
method
(see note below). If amutation_rate
is given, the mutation clock is used. The recombination clock is unsupported at this time. If neither amutation_rate
nor arecombination_rate
is given, a topology-only clock is used. Times associated with mutations and times associated with non-fixed (non-sample) nodes are overwritten. For example:mu = 1e-8 Ne = ts.diversity()/4/mu # In the absence of external info, use ts for prior Ne new_ts = tsdate.date(ts, mutation_rate=mu, population_size=Ne)
Note
This is a wrapper for the named functions that are listed in
estimation_methods
. Details and specific parameters for each estimation method are given in the documentation for those functions.- Parameters:
tree_sequence (TreeSequence) – The input tree sequence to be dated (for example one with
uncalibrated
node times).mutation_rate (float) – The estimated mutation rate per unit of genome per unit time (see individual methods)
recombination_rate (float) – The estimated recombination rate per unit of genome per unit time. If provided, the dating algorithm will use a recombination rate clock to help estimate node dates. Default:
None
(not currently implemented)time_units (str) – The time units used by the
mutation_rate
andrecombination_rate
values, and stored in thetime_units
attribute of the output tree sequence. If the conditional coalescent prior is used, then this is also applies to the value ofpopulation_size
, which in standard coalescent theory is measured in generations. Therefore if you wish to use mutation and recombination rates measured in (say) years, and are using the conditional coalescent prior, thepopulation_size
value which you provide must be scaled by multiplying by the number of years per generation. IfNone
(default), assume"generations"
.method (string) – What estimation method to use. See
estimation_methods
for possible values. IfNone
(default) the “variational_gamma” method is currently chosen.return_posteriors (bool) – If
True
, instead of returning just a dated tree sequence, return a tuple of(dated_ts, posteriors)
. Default: None, treated as False.constr_iterations (int) – The maximum number of constrained least squares iterations to use prior to forcing positive branch lengths. Default: None, treated as 0.
return_likelihood (bool) – If
True
, return the log marginal likelihood from the inside algorithm in addition to the dated tree sequence. Ifreturn_posteriors
is alsoTrue
, then the marginal likelihood will be the last element of the tuple. Default: None, treated as False.progress (bool) – Show a progress bar. Default: None, treated as False.
record_provenance (bool) – Should the tsdate command be appended to the provenence information in the returned tree sequence? Default: None, treated as True.
**kwargs – Other keyword arguments specific to the
estimation method
used. These are documented in those specific functions.
- Returns:
A copy of the input tree sequence but with updated node times, or (if
return_posteriors
orreturn_likelihood
is True) a tuple of that tree sequence plus a dictionary of posterior probabilities and/or the marginal likelihood given the mutations on the tree sequence.
- tsdate.core.estimation_methods#
The names of available estimation methods, each mapped to a function to carry out the appropriate method. Names can be passed as strings to the
date()
function, or each named function can be called directly:tsdate.variational_gamma()
: variational approximation, empirically most accurate.tsdate.inside_outside()
: empirically better, theoretically problematic.tsdate.maximization()
: worse empirically, especially with gamma approximated priors, but theoretically robust
- tsdate.variational_gamma(tree_sequence, *, mutation_rate, eps=None, max_iterations=None, rescaling_intervals=None, **kwargs)[source]#
Infer dates for nodes in a tree sequence using expectation propagation, which approximates the marginal posterior distribution of a given node’s age with a gamma distribution. Convergence to the correct posterior moments is obtained by updating the distributions for node dates using several rounds of iteration. For example:
new_ts = tsdate.variational_gamma(ts, mutation_rate=1e-8, max_iterations=10)
A piecewise-constant uniform distribution is used as a prior for each node, that is updated via expectation maximization in each iteration. Node-specific priors are not currently supported.
- Parameters:
tree_sequence (TreeSequence) – The input tree sequence to be dated.
mutation_rate (float) – The estimated mutation rate per unit of genome per unit time.
eps (float) – The minimum distance separating parent and child ages in the returned tree sequence. Default: None, treated as 1e-6
max_iterations (int) – The number of iterations used in the expectation propagation algorithm. Default: None, treated as 10.
rescaling_intervals (float) – For time rescaling, the number of time intervals within which to estimate a rescaling parameter. Setting this to zero means that rescaling is not performed. Default
None
, treated as 1000.rescaling_iterations (float) – The number of iterations for time rescaling. Setting this to zero means that rescaling is not performed. Default
None
, treated as 5.match_segregating_sites (bool) – If
True
, then time is rescaled such that branch- and site-mode segregating sites are approximately equal. IfFalse
, time is rescaled such that branch- and site-mode root-to-leaf length are approximately equal, which gives unbiased estimates when there are polytomies. DefaultFalse
.**kwargs – Other keyword arguments as described in the
date()
wrapper function, includingtime_units
,progress
, andrecord_provenance
. The argumentsreturn_posteriors
andreturn_likelihood
can be used to return additional information (see below).
- Returns:
ts (
TreeSequence
) – a copy of the input tree sequence with updated node times based on the posterior mean, corrected where necessary to ensure that parents are strictly older than all their children by an amount given by theeps
parameter.posteriors (
dict
) – (Only returned ifreturn_posteriors
isTrue
) A dictionary of posterior probabilities. Each node whose time was inferred corresponds to an item in this dictionary whose key is the node ID and value is an array of the[shape, rate]
parameters of the posterior gamma distribution for that node. When read as a pandasDataFrame
object usingpd.DataFrame(posteriors)
, the first row of the data frame is the shape and the second the rate parameter, each column being headed by the respective node ID.marginal_likelihood (
float
) – (Only returned ifreturn_likelihood
isTrue
) The marginal likelihood of the mutation data given the inferred node times. Not currently implemented for this method (set toNone
)
- tsdate.inside_outside(tree_sequence, *, mutation_rate, population_size=None, priors=None, eps=None, num_threads=None, outside_standardize=None, ignore_oldest_root=None, probability_space=None, **kwargs)[source]#
Infer dates for nodes in a genealogical graph using the “inside outside” algorithm. This approximates the marginal posterior distribution of a node’s age using an atomic discretization of time (e.g. point masses at particular timepoints).
Currently, this estimation method comprises a single “inside” followed by a similar “outside” step. The inside step passes backwards in time from the samples to the roots of the graph,taking account of the distributions of times of each node’s child (and if a
mutation_rate
is given, the the number of mutations on each edge). The outside step passes forwards in time from the roots, incorporating the time distributions for each node’s parents. If there are (undirected) cycles in the underlying graph, this method does not provide a theoretically exact estimate of the marginal posterior distribution of node ages, but in practice it results in an accurate approximation.For example:
new_ts = tsdate.inside_outside(ts, mutation_rate=1e-8, population_size=1e4)
Note
The prior parameters for each node-to-be-dated take the form of probabilities for each node at a set of discrete timepoints. If the
priors
parameter is used, it must specify an object constructed usingbuild_prior_grid()
(this can be used to define the number and position of the timepoints). Ifpriors
is not used,population_size
must be provided, which is used to create a default prior derived from the conditional coalescent (tilted according to population size and weighted by the genomic span over which a node has a given number of descendant samples). This default prior assumes the nodes to be dated are all the non-sample nodes in the input tree sequence, and that they are contemporaneous.- Parameters:
tree_sequence (TreeSequence) – The input tree sequence to be dated.
mutation_rate (float) – The estimated mutation rate per unit of genome per unit time. If provided, the dating algorithm will use a mutation rate clock to help estimate node dates. Default:
None
population_size (float or PopulationSizeHistory) – The estimated (diploid) effective population size used to construct the (default) conditional coalescent prior. For a population with constant size, this can be given as a single value (for example, as commonly estimated by the observed genetic diversity of the sample divided by four-times the expected mutation rate). Alternatively, for a population with time-varying size, this can be given directly as a
PopulationSizeHistory
object or a parameter dictionary passed to initialise aPopulationSizeHistory
object. Thepopulation_size
parameter is only used whenpriors
isNone
. Conversely, ifpriors
is notNone
, nopopulation_size
value should be specified.priors (tsdate.base.NodeGridValues) – NodeGridValues object containing the prior parameters for each node-to-be-dated. Note that different estimation methods may require different types of prior, as described in the documentation for each estimation method.
eps (float) – The error factor in time difference calculations, and the minimum distance separating parent and child ages in the returned tree sequence. Default: None, treated as 1e-6.
num_threads (int) – The number of threads to use when precalculating likelihoods. A simpler unthreaded algorithm is used unless this is >= 1. Default: None
outside_standardize (bool) – Should the likelihoods be standardized during the outside step? This can help to avoid numerical under/overflow. Using unstandardized values is mostly useful for testing (e.g. to obtain, in the outside step, the total functional value for each node). Default: None, treated as True.
ignore_oldest_root (bool) – Should the oldest root in the tree sequence be ignored in the outside algorithm (if
"inside_outside"
is used as the method). Ignoring outside root can provide greater stability when dating tree sequences inferred from real data, in particular if all local trees are assumed to coalesce in a single “grand MRCA”, as in older versions oftsinfer
. Default: None, treated as False.probability_space (string) – Should the internal algorithm save probabilities in “logarithmic” (slower, less liable to to overflow) or “linear” space (fast, may overflow). Default: “logarithmic”
**kwargs – Other keyword arguments as described in the
date()
wrapper function, notablymutation_rate
, andpopulation_size
orpriors
. Further arguments includetime_units
,progress
, andrecord_provenance
. The additional argumentsreturn_posteriors
andreturn_likelihood
can be used to return additional information (see below).
- Returns:
ts (
TreeSequence
) – a copy of the input tree sequence with updated node times based on the posterior mean, corrected where necessary to ensure that parents are strictly older than all their children by an amount given by theeps
parameter.posteriors (
dict
) – (Only returned ifreturn_posteriors
isTrue
) A dictionary of posterior probabilities. Each node whose time was inferred corresponds to an item in this dictionary whose key is the node ID and value is an array of probabilities of the node being at a list of timepoints. Timepoint values are provided in the returned dictionary under the key named “time”. When read as a pandasDataFrame
object usingpd.DataFrame(posteriors)
, the rows correspond to labelled timepoints and columns are headed by their respective node ID.marginal_likelihood (
float
) – (Only returned ifreturn_likelihood
isTrue
) The marginal likelihood of the mutation data given the inferred node times.
- tsdate.maximization(tree_sequence, *, mutation_rate, population_size=None, priors=None, eps=None, num_threads=None, probability_space=None, **kwargs)[source]#
Infer dates for nodes in a genealogical graph using the “outside maximization” algorithm. This approximates the marginal posterior distribution of a node’s age using an atomic discretization of time (e.g. point masses at particular timepoints).
This estimation method comprises a single “inside” step followed by an “outside maximization” step. The inside step passes backwards in time from the samples to the roots of the graph,taking account of the distributions of times of each node’s child (and if a
mutation_rate
is given, the the number of mutations on each edge). The outside maximization step passes forwards in time from the roots, updating each node’s time on the basis of the most likely timepoint for each parent of that node. This provides a reasonable point estimate for node times, but does not generate a true posterior time distribution.For example:
new_ts = tsdate.maximization(ts, mutation_rate=1e-8, population_size=1e4)
Note
The prior parameters for each node-to-be-dated take the form of probabilities for each node at a set of discrete timepoints. If the
priors
parameter is used, it must specify an object constructed usingbuild_prior_grid()
(this can be used to define the number and position of the timepoints). Ifpriors
is not used,population_size
must be provided, which is used to create a default prior derived from the conditional coalescent (tilted according to population size and weighted by the genomic span over which a node has a given number of descendant samples). This default prior assumes the nodes to be dated are all the non-sample nodes in the input tree sequence, and that they are contemporaneous.- Parameters:
tree_sequence (TreeSequence) – The input tree sequence to be dated.
mutation_rate (float) – The estimated mutation rate per unit of genome per unit time. If provided, the dating algorithm will use a mutation rate clock to help estimate node dates. Default:
None
population_size (float or PopulationSizeHistory) – The estimated (diploid) effective population size used to construct the (default) conditional coalescent prior. For a population with constant size, this can be given as a single value (for example, as commonly estimated by the observed genetic diversity of the sample divided by four-times the expected mutation rate). Alternatively, for a population with time-varying size, this can be given directly as a
PopulationSizeHistory
object or a parameter dictionary passed to initialise aPopulationSizeHistory
object. Thepopulation_size
parameter is only used whenpriors
isNone
. Conversely, ifpriors
is notNone
, nopopulation_size
value should be specified.priors (tsdate.base.NodeGridValues) – NodeGridValues object containing the prior parameters for each node-to-be-dated. Note that different estimation methods may require different types of prior, as described in the documentation for each estimation method.
eps (float) – The error factor in time difference calculations, and the minimum distance separating parent and child ages in the returned tree sequence. Default: None, treated as 1e-6.
num_threads (int) – The number of threads to use when precalculating likelihoods. A simpler unthreaded algorithm is used unless this is >= 1. Default: None
probability_space (string) – Should the internal algorithm save probabilities in “logarithmic” (slower, less liable to to overflow) or “linear” space (fast, may overflow). Default: None treated as”logarithmic”
**kwargs – Other keyword arguments as described in the
date()
wrapper function, notablymutation_rate
, andpopulation_size
orpriors
. Further arguments includetime_units
,progress
, andrecord_provenance
. The additionalreturn_likelihood
argument can be used to return additional information (see below). Posteriors cannot be returned using this estimation method.
- Returns:
ts (
TreeSequence
) – a copy of the input tree sequence with updated node times based on the posterior mean, corrected where necessary to ensure that parents are strictly older than all their children by an amount given by theeps
parameter.marginal_likelihood (
float
) – (Only returned ifreturn_likelihood
isTrue
) The marginal likelihood of the mutation data given the inferred node times.
Prior and Time Discretisation Options#
- tsdate.build_prior_grid(tree_sequence, population_size, timepoints=20, *, approximate_priors=False, approx_prior_size=None, prior_distribution='lognorm', progress=False, allow_unary=False)#
Using the conditional coalescent, calculate the prior distribution for the age of each node, given the number of contemporaneous samples below it, and the discretised time slices at which to evaluate node age.
- Parameters:
tree_sequence (tskit.TreeSequence) – The input
tskit.TreeSequence
, treated as undated.population_size (float or demography.PopulationSizeHistory) – The estimated (diploid) effective population size used to construct the prior. For a population with constant size, this can be given as a single value. For a population with time-varying size, this can be given directly as a
PopulationSizeHistory
object or a parameter dictionary passed to initialise aPopulationSizeHistory
object. Using standard (unscaled) values forpopulation_size
results in a prior where times are measured in generations.timepoints (int or array_like) – The number of quantiles used to create the time slices, or manually-specified time slices as a numpy array. Default: 20
approximate_priors (bool) – Whether to use a precalculated approximation to the treewise conditional coalescent prior if there are large numbers of sample tips. If an approximate prior has not been precalculated, tsdate will do so and cache the result. Default: False
approx_prior_size (int) – Number of samples above which a precalculated prior is used. Only valid if
approximate_priors
is True. Default:None
, treated asDEFAULT_APPROX_PRIOR_SIZE
ifapproximate_priors
is True.prior_distr (string) – What distribution to use to approximate the conditional coalescent prior. Can be “lognorm” for the lognormal distribution (generally a better fit, but slightly slower to calculate) or “gamma” for the gamma distribution (slightly faster, but a poorer fit for recent nodes). Default: “lognorm”
- Returns:
A prior object to pass to
date()
and similar functions containing prior values for inference and a discretised time grid- Return type:
- tsdate.build_parameter_grid(tree_sequence, population_size, *, approximate_priors=False, approx_prior_size=None, progress=False, allow_unary=False)#
Using the conditional coalescent, calculate the prior distribution for the age of each node, given the number of contemporaneous samples below it, and return parameters (shape and rate of gamma) in a grid
- Parameters:
tree_sequence (tskit.TreeSequence) – The input tree sequence, treated as undated.
population_size (float) – The estimated (diploid) effective population size: must be specified. May be a single value, or a two-column array with epoch breakpoints and effective population sizes. Using standard (unscaled) values for
population_size
results in a prior where times are measured in generations.approximate_priors (bool) – Whether to use a precalculated approximation to the treewise conditional coalescent prior if there are large numbers of sample tips. If an approximate prior has not been precalculated, tsdate will do so and cache the result. Default: False
approx_prior_size (int) – Number of samples above which a precalculated prior is used. Only valid if
approximate_priors
is True. Default:None
, treated asDEFAULT_APPROX_PRIOR_SIZE
ifapproximate_priors
is True.
- Return type:
- class tsdate.base.NodeGridValues(num_nodes, nonfixed_nodes, timepoints, fill_value=nan, dtype=<class 'numpy.float64'>)[source]#
A class to store times or discretised distributions of times for node ids. For nodes with fixed times, only a single time value needs to be stored. For non-fixed nodes, an array of len(timepoints) probabilies is required.
Note
This class is not intended to be used directly by users and may be subject to change of name or internal structure in future versions. For details on how to create a
NodeGridValues
object to be used as a prior, see More on priors (old).- Variables:
num_nodes (int) – The number of nodes that will be stored in this object
nonfixed_nodes (numpy.ndarray) – a (possibly empty) numpy array of unique positive node ids each of which must be less than num_nodes. Each will have an array of grid_size associated with it. All others (up to num_nodes) will be associated with a single scalar value instead.
timepoints (numpy.ndarray) – Array of time points
fill_value (float) – What should we fill the data arrays with to start with
- tsdate.base.DEFAULT_APPROX_PRIOR_SIZE = 10000#
The default value for approx_prior_size (see
build_prior_grid()
andbuild_parameter_grid()
)
Variable population sizes#
Preprocessing Tree Sequences#
- tsdate.preprocess_ts(tree_sequence, *, minimum_gap=None, remove_telomeres=None, delete_intervals=None, split_disjoint=None, filter_populations=False, filter_individuals=False, filter_sites=False, record_provenance=None, **kwargs)[source]#
Function to prepare tree sequences for dating by modifying the tree sequence to increase the accuracy of dating. This can involve removing data-poor regions, removing locally-unary segments of nodes via simplification, and splitting discontinuous nodes.
- Parameters:
tree_sequence (tskit.TreeSequence) – The input tree sequence to be preprocessed.
minimum_gap (float) – The minimum gap between sites to remove from the tree sequence. Default:
None
treated as1000000
. Removed regions are recorded in the provenance of the resulting tree sequence.remove_telomeres (bool) – Should all material before the first site and after the last site be removed, regardless of the length. Default:
None
treated asTrue
delete_intervals (array_like) – A list (start, end) pairs describing the genomic intervals (gaps) to delete. This is usually left as
None
(the default) in which caseminimum_gap
andremove_telomeres
are used to determine the gaps to remove, and the calculated intervals are recorded in the provenance of the resulting tree sequence.split_disjoint (bool) – Run the {func}`split_disjoint_nodes` function on the returned tree sequence, breaking any disjoint node into nodes that can be dated separately (Default:
None
treated asTrue
).filter_populations (bool) – parameter passed to the {meth}`tskit.TreeSequence.simplify` command. Unlike calling that command directly, this defaults to
False
, such that all populations in the tree sequence are kept.filter_individuals (bool) – parameter passed to the {meth}`tskit.TreeSequence.simplify` command. Unlike calling that command directly, this defaults to
False
, such that all individuals in the tree sequence are kept.filter_sites (bool) – parameter passed to the {meth}`tskit.TreeSequence.simplify` command. Unlike calling that command directly, this defaults to
False
, such that all sites in the tree sequence are kept.record_provenance (bool) – If
True
, record details of this call to simplify in the returned tree sequence’s provenance information (Default:None
treated asTrue
).**kwargs – All further keyword arguments are passed to the {meth}`tskit.TreeSequence.simplify` command.
- Returns:
A tree sequence with gaps removed.
- Return type:
- tsdate.util.split_disjoint_nodes(ts, *, record_provenance=None)[source]#
For each non-sample node, split regions separated by gaps into distinct nodes, returning a tree sequence with potentially duplicated nodes.
Where there are multiple disconnected regions, the leftmost one is assigned the ID of the original node, and the remainder are assigned new node IDs. Population, flags, individual, time, and metadata are all copied into the new nodes. Nodes that have been split will be flagged with
tsdate.NODE_SPLIT_BY_PREPROCESS
. The metadata of these nodes will also be updated with an unsplit_node_id field giving the node ID in the input tree sequence to which they correspond. If this metadata cannot be set, a warning is emitted.- Parameters:
record_provenance (bool) – If
True
, record details of this call in the returned tree sequence’s provenance information (Default:None
treated asTrue
).
Functions for Inferring Tree Sequences with Historical Samples#
- tsdate.sites_time_from_ts(tree_sequence, *, unconstrained=True, node_selection='child', min_time=1)[source]#
Returns an estimated “time” for each site. This is the estimated age of the oldest MRCA which possesses a derived variant at that site, and is useful for performing (re)inference of a tree sequence. It is calculated from the ages of nodes, with the appropriate nodes identified by the position of mutations in the trees.
If node times in the tree sequence have been estimated by
tsdate
using the inside-outside algorithm, then as well as a time in the tree sequence, nodes will store additional time estimates that have not been explictly constrained by the tree topology. By default, this function tries to use these “unconstrained” times, although this is likely to fail (with a warning) on tree sequences that have not been processed bytsdate
: in this case the standard node times can be used by settingunconstrained=False
.The concept of a site time is meaningless for non-variable sites, and so the returned time for these sites is
np.nan
(note that this is not exactly the same as tskit.UNKNOWN_TIME, which marks sites that could have a meaningful time but whose time estimate is unknown).- Parameters:
tree_sequence (tskit.TreeSequence) – The input tree sequence.
unconstrained (bool) – Use estimated node times which have not been constrained by tree topology. If
True
(default), this requires a tree sequence which has been dated using thetsdate
inside-outside algorithm. If this is not the case, specifyFalse
to use the standard tree sequence node times.node_selection (str) –
Defines how site times are calculated from the age of the upper and lower nodes that bound each mutation at the site. Options are “child”, “parent”, “arithmetic” or “geometric”, with the following meanings
'child'
(default): the site time is the age of the oldest node below each mutation at the site'parent'
: the site time is the age of the oldest node above each mutation at the site'arithmetic'
: the arithmetic mean of the ages of the node above and the node below each mutation is calculated; the site time is the oldest of these means.'geometric'
: the geometric mean of the ages of the node above and the node below each mutation is calculated; the site time is the oldest of these means
min_time (float) – A site time of zero implies that no MRCA in the past possessed the derived variant, so the variant cannot be used for inferring relationships between the samples. To allow all variants to be potentially available for inference, if a site time would otherwise be calculated as zero (for example, where the
mutation_age
parameter is “child” or “geometric” and all mutations at a site are associated with leaf nodes), a minimum site greater than 0 is recommended. By default this is set to 1, which is generally reasonable for times measured in generations or years, although it is also fine to set this to a small epsilon value.
- Returns:
Array of length tree_sequence.num_sites with estimated time of each site
- Return type:
numpy.ndarray(dtype=np.float64)
- tsdate.add_sampledata_times(samples, sites_time)[source]#
Return a tsinfer.SampleData file with estimated times associated with sites. Ensures that each site’s time is at least as old as the oldest historic sample carrying a derived allele at that site.
- Parameters:
samples (tsinfer.formats.SampleData) – A tsinfer SampleData object to add site times to. Any historic individuals in this SampleData file are used to constrain site times.
- Returns:
A tsinfer.SampleData file
- Return type:
Constants#
- tsdate.NODE_IS_HISTORICAL_SAMPLE = 1048576#
Node flag value indicating that this is a non-contemporary sample node
- tsdate.NODE_SPLIT_BY_PREPROCESS = 1073741824#
Node flag value indicating that this was a disjoint node that was then split