Introduction#
The tsdate program [Wohns et al., 2022] infers dates for nodes in a genetic genealogy, sometimes loosely known as an ancestral recombination graph or ARG [Wong et al., 2023]. More precisely, it takes a genealogy in tree sequence format as an input and returns a copy of that tree sequence with altered node and mutation times. These times have been estimated on the basis of the number of mutations along the edges connecting genomes in the genealogy (i.e. using the “molecular clock”).
Technical details#
Methodologically, the genealogy is treated as a interconnected graph, and a Bayesian network approach is used to update the probability distribution of times for each node, given the time distribution on connected nodes and the mutations on connected edges. This results in a posterior distribution of times (which can be output separately). This scales well to large genetic genealogies. The input tree sequence can come from any source: e.g. from simulation or from a variety of inference programs, such as tsinfer.
Tsdate
provides several methods for assigning probabilities to different times,
and updating information through the genealogy. These include continuous-time
(default) and discrete-time methods, see Methods for more details.
The output of tsdate is a new tree sequence with altered
node
and
mutation
times,
as well as extra node and mutation Metadata.
Optionally, a posterior distribution of node times can be generated
(see Results and posteriors).
Since the method is Bayesian, technically it requires each node to have a
prior distribution of times. The default variational_gamma
method currently
only imposes a prior on internal nodes via their topological connection to the
root nodes (which are given an exponential prior). This Empirical Bayes procedure
does not need any user input. However,
the alternative discrete-time methods currently require
the prior to be explicitly provided, either via providing an estimated
effective population size (which is then used in the
conditional coalescent), or
directly.
The ultimate source of technical detail for tsdate is the source code on our GitHub repository.
Sources of genealogies#
Input genealogies can come from any source, but tsdate is often coupled with tsinfer [Kelleher et al., 2019], which estimates the tree sequence topology but not the tree sequence node times.
GGGGAGGGTTTACGATTTCGGCCGGTC GGGGAATGTTGCGGAACCCTAAATCAC GGGGAGGGTTTACGATTTCGGCCGGTC GGGGAGGGTTGACGATTTGGGCCGCTC GCGGAGGGTCGACCATTTCGGCCGCTC CGACTAGTGTGACGTTTTCTGCCGCAC CGACTAGTGTGACGATTTCTGCCGCAG GGGGAATGTTGCGGAACCCTAAATCAC GGGGAATGTTGCGGAACCCTAAATCAC GGGGAGGGTTTACGATTTCGGCCGGTC | ||||
DNA sequence | → | tsinfer | → | tsdate |
Together, tsdate and tsinfer scale to analyses of millions of genomes, the largest genomic datasets currently available.