Methods#

The methods available for tsdate inference can be divided into discrete-time and continuous-time approaches. Both approaches iteratively propagate information between nodes to construct an approximation of the marginal posterior distribution for the age of each node, given the mutational information in the tree sequence. Discrete-time approaches approximate the posterior across a grid of discrete timepoints (e.g. assign a probability to each node being at each timepoint). Continuous-time approaches approximate the posterior by a continuous univariate distribution (e.g. a gamma distribution).

In tests, we find that the continuous-time variational_gamma approach is the most accurate. The discrete-time inside_outside approach is slightly less accurate, especially for older times, but is more numerically robust, and the discrete-time maximization approach is always stable but is the least accurate.

Changing the method is very simple:

import tskit
import tsdate

input_ts = tskit.load("data/basic_example.trees")
ts = tsdate.date(input_ts, method="variational_gamma", mutation_rate=1e-8)

Alternatively each method can be called directly as a separate function:

ts = tsdate.variational_gamma(input_ts, mutation_rate=1e-8)

Currently the default is inside_outside, but this may change in future releases.

Discrete-time#

The available discrete-time algorithms are the inside_outside and maximization methods. They have the following advantages and disadvantages:

Pros

Methods allow any shape for the distributions of times

Currently require just a single upwards and downward pass through the edges

Cons

Choice of grid timepoints is somewhat arbitrary (but reasonable defaults are picked based on the conditional coalescent)

Inferred times are imprecise due to discretization: a denser timegrid can increase precision, but also increases computational cost (quadratic with number of timepoints)

In particular, the oldest/youngest nodes can suffer from poor dating, as time into the past is an unbounded value, but a single oldest/youngest timepoint must be chosen.

Inside Outside vs Maximization#

The inside_outside approach has been shown to perform better empirically, but in theory the appraoch used does not properly account for cycles in the underlying genealogical network when updating posterior probabilities (a potential solution would be to implement a “loopy belief propagation” algorithm as in the continuous-time variational_gamma method, below). Occasionally the inside_outside method also has issues with numerical stability, although this is commonly indicative of pathological combinations of tree sequence topology and mutation patterns. Problems like this are often caused by long regions of the genome that have no mapped mutations (e.g. in the centromere), which can be removed by preprocessing.

The maximization approach is slightly less accurate empirically, and will not return true posteriors, but is theoretically robust and additionally is always numerically stable.

Continuous-time#

The only continuous-time algorithm currently implemented is the variational_gamma method.

Pros

Time estimates are precise, and not affected by choice of timepoints.

No need to define (possibly arbitrary) timegrid, and no quadratic scaling with number of timepoints

Old nodes do not suffer from time-discretisation issues caused by forcing bounds on the oldest times

Iterative updating properly accounts for cycles in the genealogy

Cons

Assumes posterior times can be reasonably modelled by gamma distributions (e.g. they are not bimodal)

The “expectation propagation” algorithm used to fit the posterior requires multiple rounds of iteration until convergence.

Numerical stability issues are more common (but often indicate pathological of otherwise problematic tree sequences)

The variational gamma method#

The variational_gamma method approximates times by fitting separate gamma distributions for each node, in a similar spirit to Schweiger and Durbin [2023]. The directed graph that represents the genealogy can (in its undirected form) contain cycles, so a technique called “expectation propagation” is used, in which local estimates to each gamma distribution are iteratively refined until they converge to a stable solution. This comes under a class of approaches sometimes known as “loopy belief propagation”.

Note

As a result of testing, the default priors used for this method are identical for all nodes (i.e. a “global” prior is used), based on a composite of all the conditional coalescent priors for all nodes. See The conditional coalescent for details.

Expectation propagation#

We are in the process of writing a formal description of the algorithm, but in summary, this approach uses an expectation propagation (“message passing”) approach to update the gamma distribution for each node based on the ages of connected nodes and the mutations on connected edges. Updates take the form of moment matching against an iteratively refined approximation to the posterior, which makes this method very fast.

The iterative expectation propagation should converge to a fixed point that approximately minimizes Kullback-Leibler divergence between the true posterior distribution and the approximation [Minka, 2001]. At the moment, when method="variational_gamma", a relatively large number of iterations is used (which testing indicates is more than enough) but convergence is not formally checked. A stopping criterion will be implemented in future releases.

Progress through iterations can be output using the progress bar:

ts = tsdate.date(
    input_ts,
    method="variational_gamma",
    progress=True,
    mutation_rate=1e-8)