The tskit ecosystem for your research
Tskit enables rapid and principled genomic analysis. The tree sequence file format provides an optimised, scalable way to store genetic variation data, and underlies modern whole-genome simulation software such as msprime. This is achieved by focussing on the evolutionary trees that underlie DNA sequence data, and providing efficent methods to analyse them. Tskit therefore provides a common software framework that can be used across the fields of population genetics, statistical genetics, and phylogenetics.
Below we summarise the application and advantages of the tskit ecosystem in four key areas:
The tskit file format can store DNA sequence data from millions of individuals in a fraction of the space required by conventional matrix-based methods. Genetic variation data in tree sequence form can be generated using a number of simulation tools or inferred from existing data.
Once in tree sequence form, the tskit library can be used for rapid analysis: there are built-in functions that efficiently calculate many standard population genetic statistics. Because tree sequences capture the full genetic relationships between a set of genomes (and can encode the full ancestral recombination graph, or ARG), they also enable novel genealogy-based approaches to population genetic analysis.
Succinct tree sequences are designed to allow efficient statistical genetic calculation when scanning along entire genomes. This is done by removing the need for complete recalculation of statistics when moving between adjacent regions of the genome. In this way, processing of millions of multi-megabase genomes can take a few seconds or less.
Storing genetic variation as tree or graph structures allows further algorithmic efficiencies to be exploited. This leads to scalable approaches to a number of statistical genetic problems, for example matching genetic sequences against each other, and accounting for correlations along the genome. Moreover, tree sequences encode the genetic genealogy or ancestry that underlies a set of genomes, which can be statistically useful in itself. For example accounting for ancestry is often necessary in genome-wide association studies (GWAS).
Tskit can be used as a phylogenetic computing library for fast processing of very large evolutionary trees. Methods exist to query features of trees and traverse through their nodes, and efficient access is also possible via arrays that provide direct memory access to the underlying tree structure.
Arbitrary structured or unstructured metadata can be associated with nodes, edges, and other entities (e.g. mutations, populations) in a tree sequence. This provides a flexible way to associate data such as names, geographical locations, and phenotypes with sample genomes and their ancestors in the tree.
Tskit also provides generalizable methods to calculate statistics based on tree branch lengths, as well as topological methods that generate, classify, and compare tree shapes. Optimised methods are provided that map genetic variation parsimoniously onto a tree, and other efficient phylogenetic algorithms can be developed in Python using numba to achieve near-native speeds.
Unlike other phylogenetic libraries, tree sequences are explicitly designed for storage and processing of large numbers of related trees. This make it easy, for example, to allow for genetic recombination within an evolutionary genealogy.
Use tskit in your application
Tskit is a lightweight library that provides both a Python and C API for reading, writing and manipulating succinct tree sequences. These APIs can be used by simulation and inference tools to write tree sequences to disk, and by analysis tools to read them, decode their nucleotide sequences, perform efficient statistical calculations, or rapidly develop new algorithms. Examples of using the tskit C API can be found here, while use of the Python API is extensively described in the tutorials below: