Python API

Python API#

This page provides formal documentation for the tsdate Python API.

Running tsdate#

tsdate.date(tree_sequence, *, mutation_rate, recombination_rate=None, time_units=None, method=None, constr_iterations=None, set_metadata=None, return_fit=None, return_likelihood=None, allow_unary=None, progress=None, record_provenance=True, **kwargs)[source]#

Infer dates for nodes in a genealogical graph (or ARG) stored in the succinct tree sequence format. New times are assigned to nodes using the estimation algorithm specified by method (see note below). A mutation_rate must be given (the recombination_rate parameter, implementing a recombination clock, is unsupported at this time). Times associated with mutations and times associated with non-fixed (non-sample) nodes are overwritten. For example:

mu = 1e-8
new_ts = tsdate.date(ts, mutation_rate=mu)

Note

This is a wrapper for the named functions that are listed in estimation_methods. Details and specific parameters for each estimation method are given in the documentation for those functions.

Parameters:

tree_sequence (TreeSequence) – The input tree sequence to be dated (for example one with uncalibrated node times).
mutation_rate (float) – The estimated mutation rate per unit of genome per unit time (see individual methods)
recombination_rate (float) – The estimated recombination rate per unit of genome per unit time. If provided, the dating algorithm will use a recombination rate clock to help estimate node dates. Default: None (not currently implemented)
time_units (str) – The time units used by the mutation_rate and recombination_rate values, and stored in the time_units attribute of the output tree sequence. If the conditional coalescent prior is used, then this is also applies to the value of population_size, which in standard coalescent theory is measured in generations. Therefore if you wish to use mutation and recombination rates measured in (say) years, and are using the conditional coalescent prior, the population_size value which you provide must be scaled by multiplying by the number of years per generation. If None (default), assume "generations".
method (string) – What estimation method to use. See estimation_methods for possible values. If None (default) the “variational_gamma” method is currently chosen.
constr_iterations (int) – The maximum number of constrained least squares iterations to use prior to forcing positive branch lengths. Default: None, treated as 0.
set_metadata (bool) – Should unconstrained times be stored in table metadata, in the form of "mn" (mean) and "vr" (variance) fields? If False, do not store metadata. If True, force metadata to be set (if no schema is set or the schema is incompatible, clear existing metadata in the relevant tables and set a new schema). If None (default), only set metadata if the existing schema allows (this may overwrite existing "mn" and "vr" fields) or if existing metadata is empty, otherwise issue a warning.
return_fit (bool) – If True, instead of just a dated tree sequence, return a tuple of (dated_ts, fit). Default: None, treated as False.
return_likelihood (bool) – If True, return the log marginal likelihood from the inside algorithm in addition to the dated tree sequence. If return_fit is also True, then the marginal likelihood will be the last element of the tuple. Default: None, treated as False.
allow_unary (bool) – Allow nodes that are “locally unary” (i.e. have only one child in one or more local trees). Default: None, treated as False.
progress (bool) – Show a progress bar. Default: None, treated as False.
record_provenance (bool) – Should the tsdate command be appended to the provenence information in the returned tree sequence? Default: None, treated as True.
**kwargs – Other keyword arguments specific to the estimation method used. These are documented in those specific functions.

Returns:

A copy of the input tree sequence but with updated node times, or (if return_fit or return_likelihood is True) a tuple of that tree sequence plus a fit object and/or the marginal likelihood given the mutations on the tree sequence.

tsdate.core.estimation_methods#

The names of available estimation methods, each mapped to a function to carry out the appropriate method. Names can be passed as strings to the date() function, or each named function can be called directly:

tsdate.variational_gamma(): variational approximation, empirically most accurate.
tsdate.inside_outside(): empirically better, theoretically problematic.
tsdate.maximization(): worse empirically, especially with gamma approximated priors, but theoretically robust

tsdate.variational_gamma(tree_sequence, *, mutation_rate, eps=None, max_iterations=None, rescaling_intervals=None, rescaling_iterations=None, match_segregating_sites=None, **kwargs)[source]#

Infer dates for nodes in a tree sequence using expectation propagation, which approximates the marginal posterior distribution of a given node’s age with a gamma distribution. Convergence to the correct posterior moments is obtained by updating the distributions for node dates using several rounds of iteration. For example:

new_ts = tsdate.variational_gamma(ts, mutation_rate=1e-8, max_iterations=10)

A piecewise-constant uniform distribution is used as a prior for each node, that is updated via expectation maximization in each iteration. Node-specific priors are not currently supported.

Parameters:

tree_sequence (TreeSequence) – The input tree sequence to be dated.
mutation_rate (float) – The estimated mutation rate per unit of genome per unit time.
eps (float) – The minimum distance separating parent and child ages in the returned tree sequence. Default: None, treated as 1e-10
max_iterations (int) – The number of iterations used in the expectation propagation algorithm. Default: None, treated as 25.
rescaling_intervals (float) – For time rescaling, the number of time intervals within which to estimate a rescaling parameter. Setting this to zero means that rescaling is not performed. Default None, treated as 1000.
rescaling_iterations (float) – The number of iterations for time rescaling. Setting this to zero means that rescaling is not performed. Default None, treated as 5.
match_segregating_sites (bool) – If True, then time is rescaled such that branch- and site-mode segregating sites are approximately equal. If False, time is rescaled such that branch- and site-mode root-to-leaf length are approximately equal, which gives unbiased estimates when there are polytomies. Default False.
**kwargs – Other keyword arguments as described in the date() wrapper function, including time_units, progress, allow_unary and record_provenance. The arguments return_fit and return_likelihood can be used to return additional information (see below).

Returns:

ts (TreeSequence) – a copy of the input tree sequence with updated node times based on the posterior mean, corrected where necessary to ensure that parents are strictly older than all their children by an amount given by the eps parameter.
fit (ExpectationPropagation) – (Only returned if return_fit is True). The underlying object used to run the dating inference. This can then be queried e.g. using node_posteriors()
marginal_likelihood (float) – (Only returned if return_likelihood is True) The marginal likelihood of the mutation data given the inferred node times. Not currently implemented for this method (set to None)

tsdate.inside_outside(tree_sequence, *, mutation_rate, population_size=None, priors=None, eps=None, num_threads=None, outside_standardize=None, ignore_oldest_root=None, probability_space=None, **kwargs)[source]#

Infer dates for nodes in a genealogical graph using the “inside outside” algorithm. This approximates the marginal posterior distribution of a node’s age using an atomic discretization of time (e.g. point masses at particular timepoints).

Currently, this estimation method comprises a single “inside” followed by a similar “outside” step. The inside step passes backwards in time from the samples to the roots of the graph,taking account of the distributions of times of each node’s child (and if a mutation_rate is given, the the number of mutations on each edge). The outside step passes forwards in time from the roots, incorporating the time distributions for each node’s parents. If there are (undirected) cycles in the underlying graph, this method does not provide a theoretically exact estimate of the marginal posterior distribution of node ages, but in practice it results in an accurate approximation.

For example:

new_ts = tsdate.inside_outside(ts, mutation_rate=1e-8, population_size=1e4)

Note

The prior parameters for each node-to-be-dated take the form of probabilities for each node at a set of discrete timepoints. If the priors parameter is used, it must specify an object constructed using build_prior_grid() (this can be used to define the number and position of the timepoints). If priors is not used, population_size must be provided, which is used to create a default prior derived from the conditional coalescent (tilted according to population size and weighted by the genomic span over which a node has a given number of descendant samples). This default prior assumes the nodes to be dated are all the non-sample nodes in the input tree sequence, and that they are contemporaneous.

Parameters:

tree_sequence (TreeSequence) – The input tree sequence to be dated.
mutation_rate (float) – The estimated mutation rate per unit of genome per unit time. If provided, the dating algorithm will use a mutation rate clock to help estimate node dates. Default: None
population_size (float or PopulationSizeHistory) – The estimated (diploid) effective population size used to construct the (default) conditional coalescent prior. For a population with constant size, this can be given as a single value (for example, as commonly estimated by the observed genetic diversity of the sample divided by four-times the expected mutation rate). Alternatively, for a population with time-varying size, this can be given directly as a PopulationSizeHistory object or a parameter dictionary passed to initialise a PopulationSizeHistory object. The population_size parameter is only used when priors is None. Conversely, if priors is not None, no population_size value should be specified.
priors (tsdate.node_time_class.NodeTimeValues) – NodeTimeValues object containing the prior parameters for each node-to-be-dated. Note that different estimation methods may require different types of prior, as described in the documentation for each estimation method.
eps (float) – The error factor in time difference calculations, and the minimum distance separating parent and child ages in the returned tree sequence. Default: None, treated as 1e-10.
num_threads (int) – The number of threads to use when precalculating likelihoods. A simpler unthreaded algorithm is used unless this is >= 1. Default: None
outside_standardize (bool) – Should the likelihoods be standardized during the outside step? This can help to avoid numerical under/overflow. Using unstandardized values is mostly useful for testing (e.g. to obtain, in the outside step, the total functional value for each node). Default: None, treated as True.
ignore_oldest_root (bool) – Should the oldest root in the tree sequence be ignored in the outside algorithm (if "inside_outside" is used as the method). Ignoring outside root can provide greater stability when dating tree sequences inferred from real data, in particular if all local trees are assumed to coalesce in a single “grand MRCA”, as in older versions of tsinfer. Default: None, treated as False.
probability_space (string) – Should the internal algorithm save probabilities in “logarithmic” (slower, less liable to to overflow) or “linear” space (fast, may overflow). Default: “logarithmic”
**kwargs – Other keyword arguments as described in the date() wrapper function, notably mutation_rate, and population_size or priors. Further arguments include time_units, progress, allow_unary and record_provenance. The additional arguments return_fit and return_likelihood can be used to return additional information (see below).

Returns:

ts (TreeSequence) – a copy of the input tree sequence with updated node times based on the posterior mean, corrected where necessary to ensure that parents are strictly older than all their children by an amount given by the eps parameter.
fit (BeliefPropagation) – (Only returned if return_fit is True) The underlying object used to run the dating inference. This can then be queried e.g. using node_posteriors()
marginal_likelihood (float) – (Only returned if return_likelihood is True) The marginal likelihood of the mutation data given the inferred node times.

tsdate.maximization(tree_sequence, *, mutation_rate, population_size=None, priors=None, eps=None, num_threads=None, probability_space=None, **kwargs)[source]#

Infer dates for nodes in a genealogical graph using the “outside maximization” algorithm. This approximates the marginal posterior distribution of a node’s age using an atomic discretization of time (e.g. point masses at particular timepoints).

This estimation method comprises a single “inside” step followed by an “outside maximization” step. The inside step passes backwards in time from the samples to the roots of the graph,taking account of the distributions of times of each node’s child (and if a mutation_rate is given, the the number of mutations on each edge). The outside maximization step passes forwards in time from the roots, updating each node’s time on the basis of the most likely timepoint for each parent of that node. This provides a reasonable point estimate for node times, but does not generate a true posterior time distribution.

For example:

new_ts = tsdate.maximization(ts, mutation_rate=1e-8, population_size=1e4)

Note

The prior parameters for each node-to-be-dated take the form of probabilities for each node at a set of discrete timepoints. If the priors parameter is used, it must specify an object constructed using build_prior_grid() (this can be used to define the number and position of the timepoints). If priors is not used, population_size must be provided, which is used to create a default prior derived from the conditional coalescent (tilted according to population size and weighted by the genomic span over which a node has a given number of descendant samples). This default prior assumes the nodes to be dated are all the non-sample nodes in the input tree sequence, and that they are contemporaneous.

Parameters:

tree_sequence (TreeSequence) – The input tree sequence to be dated.
mutation_rate (float) – The estimated mutation rate per unit of genome per unit time. If provided, the dating algorithm will use a mutation rate clock to help estimate node dates. Default: None
population_size (float or PopulationSizeHistory) – The estimated (diploid) effective population size used to construct the (default) conditional coalescent prior. For a population with constant size, this can be given as a single value (for example, as commonly estimated by the observed genetic diversity of the sample divided by four-times the expected mutation rate). Alternatively, for a population with time-varying size, this can be given directly as a PopulationSizeHistory object or a parameter dictionary passed to initialise a PopulationSizeHistory object. The population_size parameter is only used when priors is None. Conversely, if priors is not None, no population_size value should be specified.
priors (tsdate.node_time_class.NodeTimeValues) – NodeTimeValues object containing the prior parameters for each node-to-be-dated. Note that different estimation methods may require different types of prior, as described in the documentation for each estimation method.
eps (float) – The error factor in time difference calculations, and the minimum distance separating parent and child ages in the returned tree sequence. Default: None, treated as 1e-10.
num_threads (int) – The number of threads to use when precalculating likelihoods. A simpler unthreaded algorithm is used unless this is >= 1. Default: None
probability_space (string) – Should the internal algorithm save probabilities in “logarithmic” (slower, less liable to to overflow) or “linear” space (fast, may overflow). Default: None treated as”logarithmic”
**kwargs – Other keyword arguments as described in the date() wrapper function, notably mutation_rate, and population_size or priors. Further arguments include time_units, progress, allow_unary and record_provenance. The additional arguments return_fit and return_likelihood can be used to return additional information (see below).

Returns:

ts (TreeSequence) – a copy of the input tree sequence with updated node times based on the posterior mean, corrected where necessary to ensure that parents are strictly older than all their children by an amount given by the eps parameter.
marginal_likelihood (float) – (Only returned if return_likelihood is True) The marginal likelihood of the mutation data given the inferred node times.

Underlying fit objects#

Instances of the classes below are returned by setting return_fit=True when dating. The fits can be inspected to obtain more detailed results than may be present in the returned tree sequence and its metadata. The classes are not intended to be instantiated directly.

class tsdate.discrete.BeliefPropagation[source]#

The class that encapsulates running exact belief propagation models, in particular the discrete-time inside and outside algorithms.

node_posteriors()[source]#

Return the distribution of posterior node times as a structured array. The returned value can be e.g. read into pandas.DataFrame for further analysis.

Note

The outside_maximization method does not provide node time posteriors.

Returns:: The distribution of posterior node times as a structured array with columns as timepoints. Row i corresponds to the probabilities of node i lying at each timepoint. Nodes with fixed times are set to np.nan for the entire row.
Return type:: numpy.ndarray

class tsdate.variational.ExpectationPropagation[source]#

The class that encapsulates running the variational gamma approach to tsdate fitting. This contains the Expectation propagation (EP) algorithm to infer approximate marginal distributions for node ages.

The probability model has the form,

\[\prod_{i \in \mathcal{N}} f(t_i | \theta_i) \prod_{(i,j) \in \mathcal{E}} g(y_ij | t_i - t_j)\]

where \(f(.)\) is a prior distribution on node ages with parameters \(\\theta\) and \(g(.)\) are Poisson likelihoods per edge. The EP approximation to the posterior has the form,

\[\prod_{i \in \mathcal{N}} q(t_i | \eta_i) \prod_{(i,j) \in \mathcal{E}} q(t_i | \gamma_{ij}) q(t_j | \kappa_{ij})\]

where \(q(.)\) are pseudo-gamma distributions (termed ‘factors’), and \(\eta, \gamma, \kappa\) are variational parameters that reflect to prior, inside (leaf-to-root), and outside (root-to-edge) information.

Thus, the EP approximation results in gamma-distribution marginals. The factors \(q(.)\) do not need to be valid distributions (e.g. the shape/rate parameters may be negative), as long as the marginals are valid distributions. For details on how the variational parameters are optimized, see Minka (2002) “Expectation Propagation for Approximate Bayesian Inference”

node_posteriors()[source]#

Return parameters specifying the inferred posterior distribution of node times which can be e.g. read into a pandas.DataFrame for further analysis. The mean times are not strictly constrained by topology, so unlike the nodes_time attribute of a tree sequence, the mean time of a parent node may occasionally be less than that of one of its children.

Returns:: The distribution of posterior node times as a structured array of mean and variance. Row i gives the mean and variance of inferred node times for node i.
Return type:: numpy.ndarray

mutation_posteriors()[source]#

Returns parameters specifying the inferred posterior distribution of mutation times which can be e.g. read into a pandas.DataFrame for further analysis. These are calculated as the midpoint distribution of the posterior node time distributions of the node above and below the mutation. Note that this means it is possible for a mean mutation time not to lie between the mean values of its parent and child nodes.

Note

For unphased singletons, the posterior mutation time is integrated over the two possible haploid genomes on which the singleton could be placed, accounting for the relative branch lengths above each genome.

Returns:: The distribution of posterior mutation times as a structured array of mean and variance. Row i gives the mean and variance of inferred mutations times for mutation i.
Return type:: numpy.ndarray

Prior and Time Discretisation Options#

tsdate.build_prior_grid(tree_sequence, population_size, timepoints=20, *, approximate_priors=False, approx_prior_size=None, prior_distribution='lognorm', progress=False, allow_unary=False)#

Using the conditional coalescent, calculate the prior distribution for the age of each node, given the number of contemporaneous samples below it, and the discretised time slices at which to evaluate node age.

Parameters:

tree_sequence (tskit.TreeSequence) – The input tskit.TreeSequence, treated as undated.
population_size (float or demography.PopulationSizeHistory) – The estimated (diploid) effective population size used to construct the prior. For a population with constant size, this can be given as a single value. For a population with time-varying size, this can be given directly as a PopulationSizeHistory object or a parameter dictionary passed to initialise a PopulationSizeHistory object. Using standard (unscaled) values for population_size results in a prior where times are measured in generations.
timepoints (int or array_like) – The number of quantiles used to create the time slices, or manually-specified time slices as a numpy array. Default: 20
approximate_priors (bool) – Whether to use a precalculated approximation to the treewise conditional coalescent prior if there are large numbers of sample tips. If an approximate prior has not been precalculated, tsdate will do so and cache the result. Default: False
approx_prior_size (int) – Number of samples above which a precalculated prior is used. Only valid if approximate_priors is True. Default: None, treated as DEFAULT_APPROX_PRIOR_SIZE if approximate_priors is True.
prior_distr (string) – What distribution to use to approximate the conditional coalescent prior. Can be “lognorm” for the lognormal distribution (generally a better fit, but slightly slower to calculate) or “gamma” for the gamma distribution (slightly faster, but a poorer fit for recent nodes). Default: “lognorm”

Returns:

A prior object to pass to date() and similar functions containing prior values for inference and a discretised time grid

Return type:

node_time_class.NodeTimeValues

tsdate.build_parameter_grid(tree_sequence, population_size, *, approximate_priors=False, approx_prior_size=None, progress=False, allow_unary=False)#

Using the conditional coalescent, calculate the prior distribution for the age of each node, given the number of contemporaneous samples below it, and return parameters (shape and rate of gamma) in a grid

Parameters:

tree_sequence (tskit.TreeSequence) – The input tree sequence, treated as undated.
population_size (float) – The estimated (diploid) effective population size: must be specified. May be a single value, or a two-column array with epoch breakpoints and effective population sizes. Using standard (unscaled) values for population_size results in a prior where times are measured in generations.
approximate_priors (bool) – Whether to use a precalculated approximation to the treewise conditional coalescent prior if there are large numbers of sample tips. If an approximate prior has not been precalculated, tsdate will do so and cache the result. Default: False
approx_prior_size (int) – Number of samples above which a precalculated prior is used. Only valid if approximate_priors is True. Default: None, treated as DEFAULT_APPROX_PRIOR_SIZE if approximate_priors is True.

Return type:

node_time_class.NodeTimeValues

class tsdate.node_time_class.NodeTimeValues(num_nodes, nonfixed_nodes, timepoints, fill_value=nan, dtype=<class 'numpy.float64'>)[source]#

A class to store times or discretised distributions of times for node ids. For nodes with fixed times, only a single time value needs to be stored. For non-fixed nodes, an array of either len(timepoints) probabilties or a set of (gamma) distribution parameters is required.

Note

This class is not intended to be used directly by users and may be subject to change of name or internal structure in future versions. For details on how to create a NodeTimeValues object to be used as a prior, see More on priors (old).

Variables:

num_nodes (int) – The number of nodes that will be stored in this object
nonfixed_nodes (numpy.ndarray) – a (possibly empty) numpy array of unique positive node ids each of which must be less than num_nodes. Each will have an array of grid_size associated with it. All others (up to num_nodes) will be associated with a single scalar value instead.
timepoints (numpy.ndarray) – Array of time points
fill_value (float) – What should we fill the data arrays with to start with

tsdate.prior.DEFAULT_APPROX_PRIOR_SIZE = 10000#: The default value for approx_prior_size (see build_prior_grid() and build_parameter_grid())

Variable population sizes#

class tsdate.demography.PopulationSizeHistory(population_size, time_breaks=None)[source]#: Stores a piecewise constant population size history and tranforms time from a natural (generational) scale to a coalescent one.

Preprocessing Tree Sequences#

tsdate.preprocess_ts(tree_sequence, *, minimum_gap=None, erase_flanks=None, delete_intervals=None, split_disjoint=None, filter_populations=False, filter_individuals=False, filter_sites=False, record_provenance=None, remove_telomeres=None, **kwargs)[source]#

Function to prepare tree sequences for dating by modifying the tree sequence to increase the accuracy of dating. This can involve removing data-poor regions, removing locally-unary segments of nodes via simplification, and splitting discontinuous nodes.

Parameters:

tree_sequence (tskit.TreeSequence) – The input tree sequence to be preprocessed.
minimum_gap (float) – The minimum gap between sites to remove from the tree sequence. Default: None treated as 1000000. Removed regions are recorded in the provenance of the resulting tree sequence.
erase_flanks (bool) – Should all material before the first site and after the last site be removed, regardless of the length. Default: None treated as True
delete_intervals (array_like) – A list (start, end) pairs describing the genomic intervals (gaps) to delete. This is usually left as None (the default) in which case minimum_gap and erase_flanks are used to determine the gaps to remove, and the calculated intervals are recorded in the provenance of the resulting tree sequence.
split_disjoint (bool) – Run the {func}`split_disjoint_nodes` function on the returned tree sequence, breaking any disjoint node into nodes that can be dated separately (Default: None treated as True).
filter_populations (bool) – parameter passed to the {meth}`tskit.TreeSequence.simplify` command. Unlike calling that command directly, this defaults to False, such that all populations in the tree sequence are kept.
filter_individuals (bool) – parameter passed to the {meth}`tskit.TreeSequence.simplify` command. Unlike calling that command directly, this defaults to False, such that all individuals in the tree sequence are kept.
filter_sites (bool) – parameter passed to the {meth}`tskit.TreeSequence.simplify` command. Unlike calling that command directly, this defaults to False, such that all sites in the tree sequence are kept.
record_provenance (bool) – If True, record details of this call to simplify in the returned tree sequence’s provenance information (Default: None treated as True).
remove_telomeres (bool) – Deprecated alias for erase_flanks.
**kwargs – All further keyword arguments are passed to the {meth}`tskit.TreeSequence.simplify` command.

Returns:

A tree sequence with gaps removed.

Return type:

tskit.TreeSequence

tsdate.util.split_disjoint_nodes(ts, *, record_provenance=None)[source]#

For each non-sample node, split regions separated by gaps into distinct nodes, returning a tree sequence with potentially duplicated nodes.

Where there are multiple disconnected regions, the leftmost one is assigned the ID of the original node, and the remainder are assigned new node IDs. Population, flags, individual, time, and metadata are all copied into the new nodes. Nodes that have been split will be flagged with tsdate.NODE_SPLIT_BY_PREPROCESS. The metadata of these nodes will also be updated with an unsplit_node_id field giving the node ID in the input tree sequence to which they correspond. If this metadata cannot be set, a warning is emitted.

Parameters:: record_provenance (bool) – If True, record details of this call in the returned tree sequence’s provenance information (Default: None treated as True).

Functions for Inferring Tree Sequences with Historical Samples#

tsdate.sites_time_from_ts(tree_sequence, *, unconstrained=True, node_selection='child', min_time=1)[source]#

Returns an estimated “time” for each site. This is the estimated age of the oldest MRCA which possesses a derived variant at that site, and is useful for performing (re)inference of a tree sequence. It is calculated from the ages of nodes, with the appropriate nodes identified by the position of mutations in the trees.

If node times in the tree sequence have been estimated by tsdate using the inside-outside algorithm, then as well as a time in the tree sequence, nodes will store additional time estimates that have not been explictly constrained by the tree topology. By default, this function tries to use these “unconstrained” times, although this is likely to fail (with a warning) on tree sequences that have not been processed by tsdate: in this case the standard node times can be used by setting unconstrained=False.

The concept of a site time is meaningless for non-variable sites, and so the returned time for these sites is np.nan (note that this is not exactly the same as tskit.UNKNOWN_TIME, which marks sites that could have a meaningful time but whose time estimate is unknown).

Parameters:

tree_sequence (tskit.TreeSequence) – The input tree sequence.
unconstrained (bool) – Use estimated node times which have not been constrained by tree topology. If True (default), this requires a tree sequence which has been dated using the tsdate inside-outside algorithm. If this is not the case, specify False to use the standard tree sequence node times.
node_selection (str) –
Defines how site times are calculated from the age of the upper and lower nodes that bound each mutation at the site. Options are “child”, “parent”, “arithmetic” or “geometric”, with the following meanings
- 'child' (default): the site time is the age of the oldest node below each mutation at the site
- 'parent': the site time is the age of the oldest node above each mutation at the site
- 'arithmetic': the arithmetic mean of the ages of the node above and the node below each mutation is calculated; the site time is the oldest of these means.
- 'geometric': the geometric mean of the ages of the node above and the node below each mutation is calculated; the site time is the oldest of these means
min_time (float) – A site time of zero implies that no MRCA in the past possessed the derived variant, so the variant cannot be used for inferring relationships between the samples. To allow all variants to be potentially available for inference, if a site time would otherwise be calculated as zero (for example, where the mutation_age parameter is “child” or “geometric” and all mutations at a site are associated with leaf nodes), a minimum site greater than 0 is recommended. By default this is set to 1, which is generally reasonable for times measured in generations or years, although it is also fine to set this to a small epsilon value.

Returns:

Array of length tree_sequence.num_sites with estimated time of each site

Return type:

numpy.ndarray(dtype=np.float64)

tsdate.add_sampledata_times(samples, sites_time)[source]#

Return a tsinfer.SampleData file with estimated times associated with sites. Ensures that each site’s time is at least as old as the oldest historic sample carrying a derived allele at that site.

Parameters:: samples – A tsinfer SampleData object to add site times to. Any historic individuals in this SampleData file are used to constrain site times.
Returns:: A copy of the input sample data with site times added

Constants#

tsdate.NODE_IS_HISTORICAL_SAMPLE = 1048576#: Node flag value indicating that this is a non-contemporary sample node

tsdate.NODE_SPLIT_BY_PREPROCESS = 1073741824#: Node flag value indicating that this was a disjoint node that was then split

Python API

Contents

Python API#

Running tsdate#

Underlying fit objects#

Prior and Time Discretisation Options#

Variable population sizes#

Preprocessing Tree Sequences#

Functions for Inferring Tree Sequences with Historical Samples#

Constants#