What is a tree sequence?
A succinct tree sequence, or tree sequence for short, represents the relationships between a set of DNA sequences. Tree sequences can be used to store genetic data efficiently, and enable powerful analysis of millions of whole genomes at a time. They can be created by simulation or by inferring relationships from genetic variation.
Tree sequences provide:
Browse tutorials, publications and videos:
A general and efficient representation of ancestral recombination graphs
06 September, 2024
GENETICS (2024) Wong et al
This paper recounts recent developments in competing representations of Ancestral Recombination Graphs. Presenting a simple formalism that defines an ARG in terms of specific genomes and their intervals of genetic inheritance, showing how this generalizes to encompasses the outputs of recent inference methods.
ARGs as tree sequences
10 November, 2023
The tskit
library provides a convenient way to encode and work with
ancestral recombination graphs (ARGs). This tutorial introduces the
concept of “full ARGs” in tskit, the link between these ARGs and
more simplified graphical representations, and how to use tskit
to work with inheritance graphs in general.
The ARG revolution in population and statistical genetics
30 September, 2022
Wilder Wohns at SMBE Everywhere
In this video, Wilder Wohns gives examples of how ancestral recombination graphs (ARGs)
stored in tskit
format have the potential to revolutionise population and statistical
genetics. In particular, inferred tree sequences can be used for inferring geographical
locations of human ancestry, and can be used to improve the speed and accuracy of genetic
association analysis.
A unified genealogy of modern and ancient genomes
25 February, 2022
Science (2022) Wohns et al
This paper describes using tskit
, tsinfer
and tsdate
to create
a unified tree sequence of 3601 modern and 8 ancient human genome
sequences compiled from eight datasets. Then estimates
of ancestor geographic location are introduced that
recapitulate key features of human history.
Efficient ancestry and mutation simulation with msprime 1.0
13 December, 2021
Genetics (2021) Baumdicker et al
The accompanying paper to the msprime
1.0 release, summarising its features
and performance and discussing its development model.
Tskit Terminology and Concepts
21 June, 2021
This tutorial serves as an introduction to the terminology and concepts in tskit, and its underlying data structures.
Getting started with tskit
21 June, 2021
You’ve run some simulations or inference methods, and you now have a TreeSequence object; what now? This tutorial is aimed users who are new to tskit and would like to get some basic tasks completed.
Analysing Tree Sequences
21 June, 2021
This tutorial aims to give a quick overview of how the tskit statistics APIs work and how to use them effectively.
Tables and Editing
19 June, 2021
The underlying representation of a tree sequence in tskit
is a set of tables. This tutorial shows how to access and
manipulate these tables.
Analysing Trees
19 June, 2021
tskit provides single tree traversals, algorithms and phylogenetic statistics, of which this tutorial gives an overview.
Working with Metadata
16 June, 2021
This tutorial gives an overview of tskit
’s metadata system. This allows arbitrary, documented metadata to be attached to
entities in tree sequences.
Do you really need mutations?
06 June, 2021
In tree sequences, the genetic genealogy exists independently of the mutations that generate genetic variation, and often we are primarily interested in genetic variation because of what it can tell us about those genealogies. This tutorial aims to illustrate when we can leave mutations and genetic variation aside and study the genealogies directly.
Visualization
29 May, 2021
It is often helpful to visualize a single tree — or multiple trees along a tree sequence — together with
sites and mutations. tskit
provides functions to do this, outputting either plain ascii or unicode text,
or the more flexible Scalable Vector Graphics (SVG) format. This tutorial illustrates various examples.
Tskit and R
11 May, 2021
To interface with tskit
in R, we can use the reticulate
R package, which lets you call
Python functions within an R session. In this short tutorial, we’ll go through a couple of
examples to show you how to get started.
Completing forwards simulations
20 January, 2021
In this tutorial we show how to combine the best of both forwards and backwards simulation approaches by simulating
the recent past using a forwards-time simulator and then complete the simulation of the ancient past using msprime
.
msprime tutorials
19 January, 2021
A set of tutorials for msprime
. Covering demography, bottlenecks and introgression.
Efficiently Summarizing Relationships in Large Samples: A General Duality Between Statistics of Genealogies and Genomes
01 July, 2020
Genetics (2020) Ralph et al
doi: 10.1534/genetics.120.303253
This paper shows that we can think about any statistic that works on sequence data in an equivalent (and more powerful) way in terms of the underlying trees, and that we can compute these statistics very efficiently. Read this paper if you would like more technical details on how the underlying data structures work and an introduction to incremental tree sequence algorithms. —
Tree sequences and inference
22 May, 2020
Yan Wong at Phyloseminar.org
A walk through of the workings of the tsinfer algorithm,
a rapid way to infer tskit
tree sequences from existing genetic variation data.
Inferring whole-genome histories in large population datasets
02 September, 2019
Nature Genetics (2019) Kelleher et al
doi: 10.1038/s41588-019-0483-y
Start here if you’re new to tree sequences. This paper introduces tsinfer, the method to infer tree sequence topologies from genetic variation data. Please see the preprint if you cannot access the Nature Genetics paper.
Succinct tree sequences for megasample genomics (47:03)
26 April, 2019
Jerome Kelleher at MIA
Tree‐sequence recording in SLiM opens new horizons for forward‐time simulation of whole genomes
22 November, 2018
Molecular Ecology Resources (2019) Haller et al
Continuing on from the 2018 PLOS Computational Biology paper, we discuss here how the tree sequence recording method was implemented in the powerful SLiM simulator. We show how some simulations are orders of magnitude more efficient and examples of the new possibilities that keeping a full record of the genetic ancestry makes available.
Efficient pedigree recording for fast population genetics simulation
01 November, 2018
PLOS Computational Biology (2018) Kelleher et al
doi: 10.1371/journal.pcbi.1006581
Forwards-in-time simulations are very flexible but also usually very CPU intensive. This paper shows how we used tree sequences to make forwards-in-time simulations both more efficient and even more flexible.
Simulating, storing & processing genetic variation data for millions of samples
26 April, 2017
Jerome Kelleher at MIA
Efficient Coalescent Simulation and Genealogical Analysis for Large Sample Sizes
04 April, 2016
PLOS Computational Biology (2016) Kelleher et al
doi: 10.1371/journal.pcbi.1004842
This is where it all started. Here we introduce the msprime coalescent simulator and the core algorithms and data structures that would later be separated out into tskit. Read this paper if you would like to find out more about coalescent simulation, or to understand the core tree sequence algorithms and theoretical results. Note: much of the terminology has been updated since this original publication as the models were generalised.