Python API#
This page provides formal documentation for the tsdate Python API.
Running tsdate#
- tsdate.date(tree_sequence, *, mutation_rate, recombination_rate=None, time_units=None, method=None, constr_iterations=None, set_metadata=None, return_fit=None, return_likelihood=None, allow_unary=None, progress=None, record_provenance=True, **kwargs)[source]#
Infer dates for nodes in a genealogical graph (or ARG) stored in the succinct tree sequence format. New times are assigned to nodes using the estimation algorithm specified by
method
(see note below). Amutation_rate
must be given (the recombination_rate parameter, implementing a recombination clock, is unsupported at this time). Times associated with mutations and times associated with non-fixed (non-sample) nodes are overwritten. For example:mu = 1e-8 new_ts = tsdate.date(ts, mutation_rate=mu)
Note
This is a wrapper for the named functions that are listed in
estimation_methods
. Details and specific parameters for each estimation method are given in the documentation for those functions.- Parameters:
tree_sequence (TreeSequence) – The input tree sequence to be dated (for example one with
uncalibrated
node times).mutation_rate (float) – The estimated mutation rate per unit of genome per unit time (see individual methods)
recombination_rate (float) – The estimated recombination rate per unit of genome per unit time. If provided, the dating algorithm will use a recombination rate clock to help estimate node dates. Default:
None
(not currently implemented)time_units (str) – The time units used by the
mutation_rate
andrecombination_rate
values, and stored in thetime_units
attribute of the output tree sequence. If the conditional coalescent prior is used, then this is also applies to the value ofpopulation_size
, which in standard coalescent theory is measured in generations. Therefore if you wish to use mutation and recombination rates measured in (say) years, and are using the conditional coalescent prior, thepopulation_size
value which you provide must be scaled by multiplying by the number of years per generation. IfNone
(default), assume"generations"
.method (string) – What estimation method to use. See
estimation_methods
for possible values. IfNone
(default) the “variational_gamma” method is currently chosen.constr_iterations (int) – The maximum number of constrained least squares iterations to use prior to forcing positive branch lengths. Default: None, treated as 0.
set_metadata (bool) – Should unconstrained times be stored in table metadata, in the form of
"mn"
(mean) and"vr"
(variance) fields? IfFalse
, do not store metadata. IfTrue
, force metadata to be set (if no schema is set or the schema is incompatible, clear existing metadata in the relevant tables and set a new schema). IfNone
(default), only set metadata if the existing schema allows (this may overwrite existing"mn"
and"vr"
fields) or if existing metadata is empty, otherwise issue a warning.return_fit (bool) – If
True
, instead of just a dated tree sequence, return a tuple of(dated_ts, fit)
. Default: None, treated as False.return_likelihood (bool) – If
True
, return the log marginal likelihood from the inside algorithm in addition to the dated tree sequence. Ifreturn_fit
is alsoTrue
, then the marginal likelihood will be the last element of the tuple. Default: None, treated as False.allow_unary (bool) – Allow nodes that are “locally unary” (i.e. have only one child in one or more local trees). Default: None, treated as False.
progress (bool) – Show a progress bar. Default: None, treated as False.
record_provenance (bool) – Should the tsdate command be appended to the provenence information in the returned tree sequence? Default: None, treated as True.
**kwargs – Other keyword arguments specific to the
estimation method
used. These are documented in those specific functions.
- Returns:
A copy of the input tree sequence but with updated node times, or (if
return_fit
orreturn_likelihood
is True) a tuple of that tree sequence plus a fit object and/or the marginal likelihood given the mutations on the tree sequence.
- tsdate.core.estimation_methods#
The names of available estimation methods, each mapped to a function to carry out the appropriate method. Names can be passed as strings to the
date()
function, or each named function can be called directly:tsdate.variational_gamma()
: variational approximation, empirically most accurate.tsdate.inside_outside()
: empirically better, theoretically problematic.tsdate.maximization()
: worse empirically, especially with gamma approximated priors, but theoretically robust
- tsdate.variational_gamma(tree_sequence, *, mutation_rate, eps=None, max_iterations=None, rescaling_intervals=None, **kwargs)[source]#
Infer dates for nodes in a tree sequence using expectation propagation, which approximates the marginal posterior distribution of a given node’s age with a gamma distribution. Convergence to the correct posterior moments is obtained by updating the distributions for node dates using several rounds of iteration. For example:
new_ts = tsdate.variational_gamma(ts, mutation_rate=1e-8, max_iterations=10)
A piecewise-constant uniform distribution is used as a prior for each node, that is updated via expectation maximization in each iteration. Node-specific priors are not currently supported.
- Parameters:
tree_sequence (TreeSequence) – The input tree sequence to be dated.
mutation_rate (float) – The estimated mutation rate per unit of genome per unit time.
eps (float) – The minimum distance separating parent and child ages in the returned tree sequence. Default: None, treated as 1e-6
max_iterations (int) – The number of iterations used in the expectation propagation algorithm. Default: None, treated as 25.
rescaling_intervals (float) – For time rescaling, the number of time intervals within which to estimate a rescaling parameter. Setting this to zero means that rescaling is not performed. Default
None
, treated as 1000.rescaling_iterations (float) – The number of iterations for time rescaling. Setting this to zero means that rescaling is not performed. Default
None
, treated as 5.match_segregating_sites (bool) – If
True
, then time is rescaled such that branch- and site-mode segregating sites are approximately equal. IfFalse
, time is rescaled such that branch- and site-mode root-to-leaf length are approximately equal, which gives unbiased estimates when there are polytomies. DefaultFalse
.**kwargs – Other keyword arguments as described in the
date()
wrapper function, includingtime_units
,progress
,allow_unary
andrecord_provenance
. The argumentsreturn_fit
andreturn_likelihood
can be used to return additional information (see below).
- Returns:
ts (
TreeSequence
) – a copy of the input tree sequence with updated node times based on the posterior mean, corrected where necessary to ensure that parents are strictly older than all their children by an amount given by theeps
parameter.fit (
ExpectationPropagation
) – (Only returned ifreturn_fit
isTrue
). The underlying object used to run the dating inference. This can then be queried e.g. usingnode_posteriors()
marginal_likelihood (
float
) – (Only returned ifreturn_likelihood
isTrue
) The marginal likelihood of the mutation data given the inferred node times. Not currently implemented for this method (set toNone
)
- tsdate.inside_outside(tree_sequence, *, mutation_rate, population_size=None, priors=None, eps=None, num_threads=None, outside_standardize=None, ignore_oldest_root=None, probability_space=None, **kwargs)[source]#
Infer dates for nodes in a genealogical graph using the “inside outside” algorithm. This approximates the marginal posterior distribution of a node’s age using an atomic discretization of time (e.g. point masses at particular timepoints).
Currently, this estimation method comprises a single “inside” followed by a similar “outside” step. The inside step passes backwards in time from the samples to the roots of the graph,taking account of the distributions of times of each node’s child (and if a
mutation_rate
is given, the the number of mutations on each edge). The outside step passes forwards in time from the roots, incorporating the time distributions for each node’s parents. If there are (undirected) cycles in the underlying graph, this method does not provide a theoretically exact estimate of the marginal posterior distribution of node ages, but in practice it results in an accurate approximation.For example:
new_ts = tsdate.inside_outside(ts, mutation_rate=1e-8, population_size=1e4)
Note
The prior parameters for each node-to-be-dated take the form of probabilities for each node at a set of discrete timepoints. If the
priors
parameter is used, it must specify an object constructed usingbuild_prior_grid()
(this can be used to define the number and position of the timepoints). Ifpriors
is not used,population_size
must be provided, which is used to create a default prior derived from the conditional coalescent (tilted according to population size and weighted by the genomic span over which a node has a given number of descendant samples). This default prior assumes the nodes to be dated are all the non-sample nodes in the input tree sequence, and that they are contemporaneous.- Parameters:
tree_sequence (TreeSequence) – The input tree sequence to be dated.
mutation_rate (float) – The estimated mutation rate per unit of genome per unit time. If provided, the dating algorithm will use a mutation rate clock to help estimate node dates. Default:
None
population_size (float or PopulationSizeHistory) – The estimated (diploid) effective population size used to construct the (default) conditional coalescent prior. For a population with constant size, this can be given as a single value (for example, as commonly estimated by the observed genetic diversity of the sample divided by four-times the expected mutation rate). Alternatively, for a population with time-varying size, this can be given directly as a
PopulationSizeHistory
object or a parameter dictionary passed to initialise aPopulationSizeHistory
object. Thepopulation_size
parameter is only used whenpriors
isNone
. Conversely, ifpriors
is notNone
, nopopulation_size
value should be specified.priors (tsdate.node_time_class.NodeTimeValues) – NodeTimeValues object containing the prior parameters for each node-to-be-dated. Note that different estimation methods may require different types of prior, as described in the documentation for each estimation method.
eps (float) – The error factor in time difference calculations, and the minimum distance separating parent and child ages in the returned tree sequence. Default: None, treated as 1e-6.
num_threads (int) – The number of threads to use when precalculating likelihoods. A simpler unthreaded algorithm is used unless this is >= 1. Default: None
outside_standardize (bool) – Should the likelihoods be standardized during the outside step? This can help to avoid numerical under/overflow. Using unstandardized values is mostly useful for testing (e.g. to obtain, in the outside step, the total functional value for each node). Default: None, treated as True.
ignore_oldest_root (bool) – Should the oldest root in the tree sequence be ignored in the outside algorithm (if
"inside_outside"
is used as the method). Ignoring outside root can provide greater stability when dating tree sequences inferred from real data, in particular if all local trees are assumed to coalesce in a single “grand MRCA”, as in older versions oftsinfer
. Default: None, treated as False.probability_space (string) – Should the internal algorithm save probabilities in “logarithmic” (slower, less liable to to overflow) or “linear” space (fast, may overflow). Default: “logarithmic”
**kwargs – Other keyword arguments as described in the
date()
wrapper function, notablymutation_rate
, andpopulation_size
orpriors
. Further arguments includetime_units
,progress
,allow_unary
andrecord_provenance
. The additional argumentsreturn_fit
andreturn_likelihood
can be used to return additional information (see below).
- Returns:
ts (
TreeSequence
) – a copy of the input tree sequence with updated node times based on the posterior mean, corrected where necessary to ensure that parents are strictly older than all their children by an amount given by theeps
parameter.fit (
BeliefPropagation
) – (Only returned ifreturn_fit
isTrue
) The underlying object used to run the dating inference. This can then be queried e.g. usingnode_posteriors()
marginal_likelihood (
float
) – (Only returned ifreturn_likelihood
isTrue
) The marginal likelihood of the mutation data given the inferred node times.
- tsdate.maximization(tree_sequence, *, mutation_rate, population_size=None, priors=None, eps=None, num_threads=None, probability_space=None, **kwargs)[source]#
Infer dates for nodes in a genealogical graph using the “outside maximization” algorithm. This approximates the marginal posterior distribution of a node’s age using an atomic discretization of time (e.g. point masses at particular timepoints).
This estimation method comprises a single “inside” step followed by an “outside maximization” step. The inside step passes backwards in time from the samples to the roots of the graph,taking account of the distributions of times of each node’s child (and if a
mutation_rate
is given, the the number of mutations on each edge). The outside maximization step passes forwards in time from the roots, updating each node’s time on the basis of the most likely timepoint for each parent of that node. This provides a reasonable point estimate for node times, but does not generate a true posterior time distribution.For example:
new_ts = tsdate.maximization(ts, mutation_rate=1e-8, population_size=1e4)
Note
The prior parameters for each node-to-be-dated take the form of probabilities for each node at a set of discrete timepoints. If the
priors
parameter is used, it must specify an object constructed usingbuild_prior_grid()
(this can be used to define the number and position of the timepoints). Ifpriors
is not used,population_size
must be provided, which is used to create a default prior derived from the conditional coalescent (tilted according to population size and weighted by the genomic span over which a node has a given number of descendant samples). This default prior assumes the nodes to be dated are all the non-sample nodes in the input tree sequence, and that they are contemporaneous.- Parameters:
tree_sequence (TreeSequence) – The input tree sequence to be dated.
mutation_rate (float) – The estimated mutation rate per unit of genome per unit time. If provided, the dating algorithm will use a mutation rate clock to help estimate node dates. Default:
None
population_size (float or PopulationSizeHistory) – The estimated (diploid) effective population size used to construct the (default) conditional coalescent prior. For a population with constant size, this can be given as a single value (for example, as commonly estimated by the observed genetic diversity of the sample divided by four-times the expected mutation rate). Alternatively, for a population with time-varying size, this can be given directly as a
PopulationSizeHistory
object or a parameter dictionary passed to initialise aPopulationSizeHistory
object. Thepopulation_size
parameter is only used whenpriors
isNone
. Conversely, ifpriors
is notNone
, nopopulation_size
value should be specified.priors (tsdate.node_time_class.NodeTimeValues) – NodeTimeValues object containing the prior parameters for each node-to-be-dated. Note that different estimation methods may require different types of prior, as described in the documentation for each estimation method.
eps (float) – The error factor in time difference calculations, and the minimum distance separating parent and child ages in the returned tree sequence. Default: None, treated as 1e-6.
num_threads (int) – The number of threads to use when precalculating likelihoods. A simpler unthreaded algorithm is used unless this is >= 1. Default: None
probability_space (string) – Should the internal algorithm save probabilities in “logarithmic” (slower, less liable to to overflow) or “linear” space (fast, may overflow). Default: None treated as”logarithmic”
**kwargs – Other keyword arguments as described in the
date()
wrapper function, notablymutation_rate
, andpopulation_size
orpriors
. Further arguments includetime_units
,progress
,allow_unary
andrecord_provenance
. The additional argumentsreturn_fit
andreturn_likelihood
can be used to return additional information (see below).
- Returns:
ts (
TreeSequence
) – a copy of the input tree sequence with updated node times based on the posterior mean, corrected where necessary to ensure that parents are strictly older than all their children by an amount given by theeps
parameter.marginal_likelihood (
float
) – (Only returned ifreturn_likelihood
isTrue
) The marginal likelihood of the mutation data given the inferred node times.
Underlying fit objects#
Instances of the classes below are returned by setting return_fit=True
when dating. The fits can be inspected to obtain more detailed results than
may be present in the returned tree sequence and its metadata. The classes
are not intended to be instantiated directly.
- class tsdate.discrete.BeliefPropagation[source]#
The class that encapsulates running exact belief propagation models, in particular the discrete-time inside and outside algorithms.
- node_posteriors()[source]#
Return the distribution of posterior node times as a structured array. The returned value can be e.g. read into
pandas.DataFrame
for further analysis.Note
The
outside_maximization
method does not provide node time posteriors.- Returns:
The distribution of posterior node times as a structured array with columns as timepoints. Row
i
corresponds to the probabilities of nodei
lying at each timepoint. Nodes with fixed times are set tonp.nan
for the entire row.- Return type:
- class tsdate.variational.ExpectationPropagation[source]#
The class that encapsulates running the variational gamma approach to tsdate fitting. This contains the Expectation propagation (EP) algorithm to infer approximate marginal distributions for node ages.
The probability model has the form,
\[\prod_{i \in \mathcal{N}} f(t_i | \theta_i) \prod_{(i,j) \in \mathcal{E}} g(y_ij | t_i - t_j)\]where \(f(.)\) is a prior distribution on node ages with parameters \(\\theta\) and \(g(.)\) are Poisson likelihoods per edge. The EP approximation to the posterior has the form,
\[\prod_{i \in \mathcal{N}} q(t_i | \eta_i) \prod_{(i,j) \in \mathcal{E}} q(t_i | \gamma_{ij}) q(t_j | \kappa_{ij})\]where \(q(.)\) are pseudo-gamma distributions (termed ‘factors’), and \(\eta, \gamma, \kappa\) are variational parameters that reflect to prior, inside (leaf-to-root), and outside (root-to-edge) information.
Thus, the EP approximation results in gamma-distribution marginals. The factors \(q(.)\) do not need to be valid distributions (e.g. the shape/rate parameters may be negative), as long as the marginals are valid distributions. For details on how the variational parameters are optimized, see Minka (2002) “Expectation Propagation for Approximate Bayesian Inference”
- node_posteriors()[source]#
Return parameters specifying the inferred posterior distribution of node times which can be e.g. read into a
pandas.DataFrame
for further analysis. The mean times are not strictly constrained by topology, so unlike thenodes_time
attribute of a tree sequence, the mean time of a parent node may occasionally be less than that of one of its children.- Returns:
The distribution of posterior node times as a structured array of mean and variance. Row
i
gives the mean and variance of inferred node times for nodei
.- Return type:
- mutation_posteriors()[source]#
Returns parameters specifying the inferred posterior distribution of mutation times which can be e.g. read into a
pandas.DataFrame
for further analysis. These are calculated as the midpoint distribution of the posterior node time distributions of the node above and below the mutation. Note that this means it is possible for a mean mutation time not to lie between the mean values of its parent and child nodes.Note
For unphased singletons, the posterior mutation time is integrated over the two possible haploid genomes on which the singleton could be placed, accounting for the relative branch lengths above each genome.
- Returns:
The distribution of posterior mutation times as a structured array of mean and variance. Row
i
gives the mean and variance of inferred mutations times for mutationi
.- Return type:
Prior and Time Discretisation Options#
- tsdate.build_prior_grid(tree_sequence, population_size, timepoints=20, *, approximate_priors=False, approx_prior_size=None, prior_distribution='lognorm', progress=False, allow_unary=False)#
Using the conditional coalescent, calculate the prior distribution for the age of each node, given the number of contemporaneous samples below it, and the discretised time slices at which to evaluate node age.
- Parameters:
tree_sequence (tskit.TreeSequence) – The input
tskit.TreeSequence
, treated as undated.population_size (float or demography.PopulationSizeHistory) – The estimated (diploid) effective population size used to construct the prior. For a population with constant size, this can be given as a single value. For a population with time-varying size, this can be given directly as a
PopulationSizeHistory
object or a parameter dictionary passed to initialise aPopulationSizeHistory
object. Using standard (unscaled) values forpopulation_size
results in a prior where times are measured in generations.timepoints (int or array_like) – The number of quantiles used to create the time slices, or manually-specified time slices as a numpy array. Default: 20
approximate_priors (bool) – Whether to use a precalculated approximation to the treewise conditional coalescent prior if there are large numbers of sample tips. If an approximate prior has not been precalculated, tsdate will do so and cache the result. Default: False
approx_prior_size (int) – Number of samples above which a precalculated prior is used. Only valid if
approximate_priors
is True. Default:None
, treated asDEFAULT_APPROX_PRIOR_SIZE
ifapproximate_priors
is True.prior_distr (string) – What distribution to use to approximate the conditional coalescent prior. Can be “lognorm” for the lognormal distribution (generally a better fit, but slightly slower to calculate) or “gamma” for the gamma distribution (slightly faster, but a poorer fit for recent nodes). Default: “lognorm”
- Returns:
A prior object to pass to
date()
and similar functions containing prior values for inference and a discretised time grid- Return type:
- tsdate.build_parameter_grid(tree_sequence, population_size, *, approximate_priors=False, approx_prior_size=None, progress=False, allow_unary=False)#
Using the conditional coalescent, calculate the prior distribution for the age of each node, given the number of contemporaneous samples below it, and return parameters (shape and rate of gamma) in a grid
- Parameters:
tree_sequence (tskit.TreeSequence) – The input tree sequence, treated as undated.
population_size (float) – The estimated (diploid) effective population size: must be specified. May be a single value, or a two-column array with epoch breakpoints and effective population sizes. Using standard (unscaled) values for
population_size
results in a prior where times are measured in generations.approximate_priors (bool) – Whether to use a precalculated approximation to the treewise conditional coalescent prior if there are large numbers of sample tips. If an approximate prior has not been precalculated, tsdate will do so and cache the result. Default: False
approx_prior_size (int) – Number of samples above which a precalculated prior is used. Only valid if
approximate_priors
is True. Default:None
, treated asDEFAULT_APPROX_PRIOR_SIZE
ifapproximate_priors
is True.
- Return type:
- class tsdate.node_time_class.NodeTimeValues(num_nodes, nonfixed_nodes, timepoints, fill_value=nan, dtype=<class 'numpy.float64'>)[source]#
A class to store times or discretised distributions of times for node ids. For nodes with fixed times, only a single time value needs to be stored. For non-fixed nodes, an array of either len(timepoints) probabilties or a set of (gamma) distribution parameters is required.
Note
This class is not intended to be used directly by users and may be subject to change of name or internal structure in future versions. For details on how to create a
NodeTimeValues
object to be used as a prior, see More on priors (old).- Variables:
num_nodes (int) – The number of nodes that will be stored in this object
nonfixed_nodes (numpy.ndarray) – a (possibly empty) numpy array of unique positive node ids each of which must be less than num_nodes. Each will have an array of grid_size associated with it. All others (up to num_nodes) will be associated with a single scalar value instead.
timepoints (numpy.ndarray) – Array of time points
fill_value (float) – What should we fill the data arrays with to start with
- tsdate.prior.DEFAULT_APPROX_PRIOR_SIZE = 10000#
The default value for approx_prior_size (see
build_prior_grid()
andbuild_parameter_grid()
)
Variable population sizes#
Preprocessing Tree Sequences#
- tsdate.preprocess_ts(tree_sequence, *, minimum_gap=None, remove_telomeres=None, delete_intervals=None, split_disjoint=None, filter_populations=False, filter_individuals=False, filter_sites=False, record_provenance=None, **kwargs)[source]#
Function to prepare tree sequences for dating by modifying the tree sequence to increase the accuracy of dating. This can involve removing data-poor regions, removing locally-unary segments of nodes via simplification, and splitting discontinuous nodes.
- Parameters:
tree_sequence (tskit.TreeSequence) – The input tree sequence to be preprocessed.
minimum_gap (float) – The minimum gap between sites to remove from the tree sequence. Default:
None
treated as1000000
. Removed regions are recorded in the provenance of the resulting tree sequence.remove_telomeres (bool) – Should all material before the first site and after the last site be removed, regardless of the length. Default:
None
treated asTrue
delete_intervals (array_like) – A list (start, end) pairs describing the genomic intervals (gaps) to delete. This is usually left as
None
(the default) in which caseminimum_gap
andremove_telomeres
are used to determine the gaps to remove, and the calculated intervals are recorded in the provenance of the resulting tree sequence.split_disjoint (bool) – Run the {func}`split_disjoint_nodes` function on the returned tree sequence, breaking any disjoint node into nodes that can be dated separately (Default:
None
treated asTrue
).filter_populations (bool) – parameter passed to the {meth}`tskit.TreeSequence.simplify` command. Unlike calling that command directly, this defaults to
False
, such that all populations in the tree sequence are kept.filter_individuals (bool) – parameter passed to the {meth}`tskit.TreeSequence.simplify` command. Unlike calling that command directly, this defaults to
False
, such that all individuals in the tree sequence are kept.filter_sites (bool) – parameter passed to the {meth}`tskit.TreeSequence.simplify` command. Unlike calling that command directly, this defaults to
False
, such that all sites in the tree sequence are kept.record_provenance (bool) – If
True
, record details of this call to simplify in the returned tree sequence’s provenance information (Default:None
treated asTrue
).**kwargs – All further keyword arguments are passed to the {meth}`tskit.TreeSequence.simplify` command.
- Returns:
A tree sequence with gaps removed.
- Return type:
- tsdate.util.split_disjoint_nodes(ts, *, record_provenance=None)[source]#
For each non-sample node, split regions separated by gaps into distinct nodes, returning a tree sequence with potentially duplicated nodes.
Where there are multiple disconnected regions, the leftmost one is assigned the ID of the original node, and the remainder are assigned new node IDs. Population, flags, individual, time, and metadata are all copied into the new nodes. Nodes that have been split will be flagged with
tsdate.NODE_SPLIT_BY_PREPROCESS
. The metadata of these nodes will also be updated with an unsplit_node_id field giving the node ID in the input tree sequence to which they correspond. If this metadata cannot be set, a warning is emitted.- Parameters:
record_provenance (bool) – If
True
, record details of this call in the returned tree sequence’s provenance information (Default:None
treated asTrue
).
Functions for Inferring Tree Sequences with Historical Samples#
- tsdate.sites_time_from_ts(tree_sequence, *, unconstrained=True, node_selection='child', min_time=1)[source]#
Returns an estimated “time” for each site. This is the estimated age of the oldest MRCA which possesses a derived variant at that site, and is useful for performing (re)inference of a tree sequence. It is calculated from the ages of nodes, with the appropriate nodes identified by the position of mutations in the trees.
If node times in the tree sequence have been estimated by
tsdate
using the inside-outside algorithm, then as well as a time in the tree sequence, nodes will store additional time estimates that have not been explictly constrained by the tree topology. By default, this function tries to use these “unconstrained” times, although this is likely to fail (with a warning) on tree sequences that have not been processed bytsdate
: in this case the standard node times can be used by settingunconstrained=False
.The concept of a site time is meaningless for non-variable sites, and so the returned time for these sites is
np.nan
(note that this is not exactly the same as tskit.UNKNOWN_TIME, which marks sites that could have a meaningful time but whose time estimate is unknown).- Parameters:
tree_sequence (tskit.TreeSequence) – The input tree sequence.
unconstrained (bool) – Use estimated node times which have not been constrained by tree topology. If
True
(default), this requires a tree sequence which has been dated using thetsdate
inside-outside algorithm. If this is not the case, specifyFalse
to use the standard tree sequence node times.node_selection (str) –
Defines how site times are calculated from the age of the upper and lower nodes that bound each mutation at the site. Options are “child”, “parent”, “arithmetic” or “geometric”, with the following meanings
'child'
(default): the site time is the age of the oldest node below each mutation at the site'parent'
: the site time is the age of the oldest node above each mutation at the site'arithmetic'
: the arithmetic mean of the ages of the node above and the node below each mutation is calculated; the site time is the oldest of these means.'geometric'
: the geometric mean of the ages of the node above and the node below each mutation is calculated; the site time is the oldest of these means
min_time (float) – A site time of zero implies that no MRCA in the past possessed the derived variant, so the variant cannot be used for inferring relationships between the samples. To allow all variants to be potentially available for inference, if a site time would otherwise be calculated as zero (for example, where the
mutation_age
parameter is “child” or “geometric” and all mutations at a site are associated with leaf nodes), a minimum site greater than 0 is recommended. By default this is set to 1, which is generally reasonable for times measured in generations or years, although it is also fine to set this to a small epsilon value.
- Returns:
Array of length tree_sequence.num_sites with estimated time of each site
- Return type:
numpy.ndarray(dtype=np.float64)
- tsdate.add_sampledata_times(samples, sites_time)[source]#
Return a tsinfer.SampleData file with estimated times associated with sites. Ensures that each site’s time is at least as old as the oldest historic sample carrying a derived allele at that site.
- Parameters:
samples (tsinfer.formats.SampleData) – A tsinfer SampleData object to add site times to. Any historic individuals in this SampleData file are used to constrain site times.
- Returns:
A tsinfer.SampleData file
- Return type:
Constants#
- tsdate.NODE_IS_HISTORICAL_SAMPLE = 1048576#
Node flag value indicating that this is a non-contemporary sample node
- tsdate.NODE_SPLIT_BY_PREPROCESS = 1073741824#
Node flag value indicating that this was a disjoint node that was then split