Introduction#

tsdate [Wohns et al., 2022] infers dates for nodes in a genetic genealogy, sometimes loosely known as an ancestral recombination graph or ARG [Wong et al., 2023]. More precisely, it takes a genealogy in tree sequence format as an input and returns a copy of that tree sequence with altered node times. These times have been estimated on the basis of the number of mutations along the edges connecting genomes in the genealogy (i.e. using the “molecular clock”).

Technical details#

Methodologically, the genealogy is treated as a interconnected graph, and a Bayesian network approach is used to update the probability distribution of times for each node, given the time distribution on connected nodes and the mutations on connected edges. This results in a posterior distribution of times (which can be output separately). This scales well to large genetic genealogies. The input tree sequence can come from any source: e.g. from simulation or from a variety of inference programs, such as tsinfer.

As the approach is Bayesian, it requires a prior distribution to be defined for each of the nodes to date. By default, tsdate calculates priors from the conditional coalescent, although alternative prior distributions can also be specified.

tsdate provides several methods for assigning probabilities to different times, and updating information through the genealogy. These include discrete-time and continuous-time methods, see Methods for more details.

The output of tsdate is a new tree sequence with altered node times, extra node Metadata, and (optionally) a posterior distribution of node times (see Posterior time distributions).

The ultimate source of technical detail for tsdate is the source code on our GitHub repository.

Sources of genealogies#

The input genealogies to date can come from any source, but tsdate is often coupled with tsinfer [Kelleher et al., 2019], which estimates the tree sequence topology but not the tree sequence node times.

An example of using `tsinfer` followed by `tsdate` on some DNA sequence data. You can see that tsdate sets a timescale and changes node times so that mutations (red crosses) are more evenly distributed over edges of the genealogy. This results in more realistic local trees (with coalescences clustered, as expected from theory, at recent times)
CCGTATTACCGAGGTCAGATGATCAGGCTATAAAC
AGGTAGTATCGAGGAAATATGATGAGGCTATCGAC
CCGTAGTTCCAAGGTAAGATGATGAGGCTATCAAC
CCGCCGTACAGACGAAATGAATTGGGCTTTACGAT
CCGTAGTACCGTGCTAAGATGAAGAGGCTATCATC
CCGTATTACCGAGGTAAGATGATCACGCTATAAAC
CCGCCGTACAGACGAAATGAATTGGGGTATACGAT
CCGTATTACCGAGGTAAGATGATCAGGCTATAAAC
CCACCGGACAGACGAACTGAATTGGGGTTTACGAT
CCGTAGTTCCAAGGTAAGATGATGAGGCTATCAAC
Genome position30139988Time (uncalibrated)01Genome position30139988Time (generations)0100020003000
DNA sequencetsinfertsdate

Together, tsdate and tsinfer scale to analyses of millions of genomes, the largest genomic datasets currently available.