Introduction#
tsdate
[Wohns et al., 2022] infers dates for nodes in a genetic genealogy,
sometimes loosely known as an ancestral recombination graph or ARG
[Wong et al., 2023]. More precisely, it takes a genealogy in
tree sequence format as an input
and returns a copy of that tree sequence with altered node times. These
times have been estimated on the basis of the number of mutations
along the edges connecting genomes in the genealogy (i.e. using the “molecular clock”).
Technical details#
Methodologically, the genealogy is treated as a interconnected graph, and a Bayesian network approach is used to update the probability distribution of times for each node, given the time distribution on connected nodes and the mutations on connected edges. This results in a posterior distribution of times (which can be output separately). This scales well to large genetic genealogies. The input tree sequence can come from any source: e.g. from simulation or from a variety of inference programs, such as tsinfer.
As the approach is Bayesian, it requires a
prior distribution to be defined
for each of the nodes to date. By default, tsdate
calculates priors from the
conditional coalescent, although
alternative prior distributions can also be specified.
tsdate
provides several methods for assigning probabilities to different times,
and updating information through the genealogy. These include discrete-time and
continuous-time methods, see Methods for more details.
The output of tsdate is a new tree sequence with altered
node times
, extra node Metadata, and
(optionally) a posterior distribution of node times
(see Posterior time distributions).
The ultimate source of technical detail for tsdate
is the source code on our
GitHub repository.
Sources of genealogies#
The input genealogies to date can come from any source,
but tsdate
is often coupled with tsinfer
[Kelleher et al., 2019], which estimates the tree sequence topology but
not the tree sequence node times.
CCGTATTACCGAGGTCAGATGATCAGGCTATAAAC AGGTAGTATCGAGGAAATATGATGAGGCTATCGAC CCGTAGTTCCAAGGTAAGATGATGAGGCTATCAAC CCGCCGTACAGACGAAATGAATTGGGCTTTACGAT CCGTAGTACCGTGCTAAGATGAAGAGGCTATCATC CCGTATTACCGAGGTAAGATGATCACGCTATAAAC CCGCCGTACAGACGAAATGAATTGGGGTATACGAT CCGTATTACCGAGGTAAGATGATCAGGCTATAAAC CCACCGGACAGACGAACTGAATTGGGGTTTACGAT CCGTAGTTCCAAGGTAAGATGATGAGGCTATCAAC | ||||
DNA sequence | → | tsinfer | → | tsdate |
Together, tsdate
and tsinfer
scale to analyses of millions of genomes, the largest genomic datasets currently available.