Introduction#

The tsdate program [Wohns et al., 2022] infers dates for nodes in a genetic genealogy, sometimes loosely known as an ancestral recombination graph or ARG [Wong et al., 2023]. More precisely, it takes a genealogy in tree sequence format as an input and returns a copy of that tree sequence with altered node and mutation times. These times have been estimated on the basis of the number of mutations along the edges connecting genomes in the genealogy (i.e. using the “molecular clock”).

Technical details#

Methodologically, the genealogy is treated as a interconnected graph, and a Bayesian network approach is used to update the probability distribution of times for each node, given the time distribution on connected nodes and the mutations on connected edges. This results in a posterior distribution of times (which can be output separately). This scales well to large genetic genealogies. The input tree sequence can come from any source: e.g. from simulation or from a variety of inference programs, such as tsinfer.

Tsdate provides several methods for assigning probabilities to different times, and updating information through the genealogy. These include continuous-time (default) and discrete-time methods, see Methods for more details.

The output of tsdate is a new tree sequence with altered nodeand mutation times, as well as extra node and mutation Metadata. Optionally, a posterior distribution of node times can be generated (see Results and posteriors).

Since the method is Bayesian, technically it requires each node to have a prior distribution of times. The default variational_gamma method currently only imposes a prior on internal nodes via their topological connection to the root nodes (which are given an exponential prior). This Empirical Bayes procedure does not need any user input. However, the alternative discrete-time methods currently require the prior to be explicitly provided, either via providing an estimated effective population size (which is then used in the conditional coalescent), or directly.

The ultimate source of technical detail for tsdate is the source code on our GitHub repository.

Sources of genealogies#

Input genealogies can come from any source, but tsdate is often coupled with tsinfer [Kelleher et al., 2019], which estimates the tree sequence topology but not the tree sequence node times.

An example of using tsinfer followed by tsdate on some DNA sequence data, illustrating that tsdate sets a timescale and changes node times so that mutations (red crosses) are more evenly distributed over edges of the genealogy. The modified genealogy also shows an increase in recent coalescences, as expected from theory.
GGGGAGGGTTTACGATTTCGGCCGGTC
GGGGAATGTTGCGGAACCCTAAATCAC
GGGGAGGGTTTACGATTTCGGCCGGTC
GGGGAGGGTTGACGATTTGGGCCGCTC
GCGGAGGGTCGACCATTTCGGCCGCTC
CGACTAGTGTGACGTTTTCTGCCGCAC
CGACTAGTGTGACGATTTCTGCCGCAG
GGGGAATGTTGCGGAACCCTAAATCAC
GGGGAATGTTGCGGAACCCTAAATCAC
GGGGAGGGTTTACGATTTCGGCCGGTC
Genome position64647978Time ago (uncalibrated)01Genome position64647978Time ago (generations)0200040006000
DNA sequencetsinfertsdate

Together, tsdate and tsinfer scale to analyses of millions of genomes, the largest genomic datasets currently available.