What is a tree sequence?

A succinct tree sequence, or tree sequence for short, represents the relationships between a set of DNA sequences. Tree sequences can be used to store genetic data efficiently, and enable powerful analysis of millions of whole genomes at a time. They can be created by simulation or by inferring relationships from genetic variation.

Tree sequences provide:
A record of full genetic ancestry
A tree sequence concisely captures the full history of a set of genomes by sharing common branches between adjacent genetic trees.
Read more ...
An encoding of DNA data
Placing mutations on a tree sequence allows lossless representation and compression of DNA datasets.
Read more ...
An efficient analysis framework
Many of the analyses we perform on DNA data can be expressed naturally as operations on the ancestral trees, leading to highly efficient algorithms.
Read more ...

See the tutorials for more details and extensive examples on how to generate and process tree sequences.


Wilder Wohns 2020 Phyloseminar Tree sequence fundamentals
Yan Wong 2020 Phyloseminar Tree sequences and inference
Jerome Kelleher 2020 PopGen Vienna Inferring the ancestry of everyone
Wilder Wohns 2019 MIA Primer Introduction to the tree sequence toolchain
Jerome Kelleher 2019 MIA Succinct tree sequences for megasample genomics
Jerome Kelleher 2017 MIA Simulating, storing & processing genetic variation data for millions of samples

Key publications

Inferring whole-genome histories in large population datasets, Nature Genetics (2019). Kelleher, Wong, Wohns, Fadil, Albers and McVean. doi:10.1038/s41588-019-0483-y
Start here if you're new to tree sequences. This paper introduces tsinfer, the method to infer tree sequence topologies from genetic variation data. Please see the preprint if you cannot access the Nature Genetics paper.
Efficiently Summarizing Relationships in Large Samples: A General Duality Between Statistics of Genealogies and Genomes, Genetics (2020). Ralph, Thornton and Kelleher. doi:10.1534/genetics.120.303253
This paper shows that we can think about any statistic that works on sequence data in an equivalent (and more powerful) way in terms of the underlying trees, and that we can compute these statistics very efficiently. Read this paper if you would like more technical details on how the underlying data structures work and an introduction to incremental tree sequence algorithms.
Efficient pedigree recording for fast population genetics simulation, PLOS Computational Biology (2018). Kelleher, Thornton, Ashander and Ralph. doi:10.1371/journal.pcbi.1006581
Forwards-in-time simulations are very flexible but also usually very CPU intensive. This paper shows how we used tree sequences to make forwards-in-time simulations both more efficient and even more flexible.
Tree‐sequence recording in SLiM opens new horizons for forward‐time simulation of whole genomes, Molecular Ecology Resources (2019). Haller, Galloway, Kelleher, Messer and Ralph. doi:10.1111/1755-0998.12968
Continuing on from the 2018 PLOS Computational Biology paper, we discuss here how the tree sequence recording method was implemented in the powerful SLiM simulator. We show how some simulations are orders of magnitude more efficient and examples of the new possibilities that keeping a full record of the genetic ancestry makes available.
Efficient Coalescent Simulation and Genealogical Analysis for Large Sample Sizes, PLOS Computational Biology (2016). Kelleher, Etheridge and McVean. doi:10.1371/journal.pcbi.1004842
This is where it all started. Here we introduce the msprime coalescent simulator and the core algorithms and data structures that would later be separated out into tskit. Read this paper if you would like to find out more about coalescent simulation, or to understand the core tree sequence algorithms and theoretical results. Note: much of the terminology has been updated since this original publication as the models were generalised.