Methods#
The methods available for tsdate inference can be divided into continuous-time and discrete-time approaches. Both approaches iteratively propagate information between nodes to construct an approximation of the marginal posterior distribution for the age of each node, given the mutational information in the tree sequence. Discrete-time approaches approximate the posterior across a grid of discrete timepoints (e.g. assign a probability to each node being at each timepoint). Continuous-time approaches approximate the posterior by a continuous univariate distribution (e.g. a gamma distribution).
In tests, we find that the continuous-time variational_gamma
approach is the
most accurate. The discrete-time inside_outside
approach is slightly less
accurate, especially for older times, but is slightly more numerically robust
and also allows each node to have an arbitrary (discretised) probability distribution.
The discrete-time maximization
approach is always stable but is the least
accurate.
Changing the method is very simple:
import tskit
import tsdate
input_ts = tskit.load("data/basic_example.trees")
ts = tsdate.date(input_ts, method="inside_outside", mutation_rate=1e-8, population_size=1000)
Alternatively each method can be called directly as a separate function:
ts = tsdate.inside_outside(input_ts, mutation_rate=1e-8, population_size=1000)
The available method names and functions are also available via the
tsdate.core.estimation_methods
variable.
Continuous-time#
The only continuous-time algorithm currently implemented is the variational_gamma
method.
- Pros
Time estimates are precise, and not affected by choice of timepoints.
No need to define (possibly arbitrary) timegrid, and no quadratic scaling with number of timepoints
Old nodes do not suffer from time-discretisation issues caused by forcing bounds on the oldest times
Iterative updating properly accounts for cycles in the genealogy
No need to specify node-specific priors; a mixture “prior” (fit by expectation-maximization) is used to regularise the roots.
Can account for variable population sizes using rescaling
- Cons
Assumes posterior times can be reasonably modelled by gamma distributions (e.g. they are not bimodal)
The “expectation propagation” algorithm used to fit the posterior requires multiple rounds of iteration until convergence.
Numerical stability issues are more common (but often indicate pathological of otherwise problematic tree sequences)
The variational gamma method#
The variational_gamma
method approximates times by fitting separate gamma
distributions for each node, in a similar spirit to Schweiger and Durbin [2023].
The directed graph that represents the genealogy can (in its undirected form) contain
cycles, so a technique called “expectation propagation” (EP) is used, in which
local estimates to each gamma distribution are iteratively refined until
they converge to a stable solution. This comes under a class of approaches
sometimes known as “loopy belief propagation”.
Expectation propagation#
We are in the process of writing a formal description of the algorithm, but in summary, this approach uses an expectation propagation (“message passing”) approach to update the gamma distribution for each node based on the ages of connected nodes and the mutations on connected edges. Updates take the form of moment matching against an iteratively refined approximation to the posterior, which makes this method very fast.
The iterative expectation propagation should converge to a fixed
point that approximately minimizes Kullback-Leibler divergence between the true posterior
distribution and the approximation [Minka, 2001].
At the moment a relatively large number of iterations are used (which testing indicates is
more than enough, but which can be also changed by using the max_iterations
parameter);
however, convergence is not formally checked.
A stopping criterion will be implemented in future releases.
Progress through iterations can be output using the progress bar:
ts = tsdate.date(input_ts, mutation_rate=1e-8, progress=True)
Rescaling#
During each EP step, the variational_gamma
method implements a further
process called rescaling, and which can help to deal with the effects of
variable population size though time. This is based on an algorithm introduced
by the ARG inference software
SINGER (Deng et al 2024) that
rescales node ages by matching observed and expected segregating sites within
time windows.
Basically, time is broken up into a number of intervals, and times within
intervals are simultaneously scaled such that the expected density of mutations
along each path from a sample to the root best matches the mutational density
predicted from the user-provided mutation rate. The number of intervals can be
specified using the rescaling_intervals
parameter. If set to 0, no rescaling
is performed; this means that dates may be inaccurately estimated if the
dataset comes from a set of samples with a complex demographic history.
tsdate
uses a modified version of Deng et al’s algorithm that works on gamma
natural parameters rather than point estimates, and that is not biased by the
artefactual polytomies introduced by tsinfer
for the sake of compression.
TODO: describe the rescaling step in more detail. Could also link to the population size docs
Discrete-time#
The available discrete-time algorithms are the inside_outside
and maximization
methods.
For historical reasons, these approaches use an informative (node-specific) prior,
the conditional coalescent prior,
which means that you either need to provide them with an estimated effective
population size, or a priors object. Future improvements may
allow Empirical Bayes priors to be set in discrete time methods, and coalescent priors
to be set in continuous time methods.
The tsdate discrete time methods have the following advantages and disadvantages:
- Pros
Methods allow any shape for the distributions of times
Currently require just a single upwards and downward pass through the edges
- Cons
Choice of grid timepoints is somewhat arbitrary (but reasonable defaults are picked based on the conditional coalescent)
Inferred times are imprecise due to discretization: a denser timegrid can increase precision, but also increases computational cost (quadratic with number of timepoints)
In particular, the oldest/youngest nodes can suffer from poor dating, as time into the past is an unbounded value, but a single oldest/youngest timepoint must be chosen.
Inside Outside vs Maximization#
The inside_outside
approach has been shown to perform better empirically, but
in theory the appraoch used does not properly account for cycles in the underlying
genealogical network when updating posterior probabilities (a potential solution
would be to implement a “loopy belief propagation” algorithm as in the continuous-time
variational_gamma
method, above).
Occasionally the inside_outside
method also
has issues with numerical stability, although this is commonly indicative
of pathological combinations of tree sequence topology and mutation patterns.
Problems like this are often caused by long regions of the genome that
have no mapped mutations (e.g. in the centromere), which can be removed by
preprocessing.
The maximization
approach is slightly less accurate empirically,
and will not return true posteriors, but is theoretically robust and
additionally is always numerically stable.