Quickstart

Our tutorials site has a more extensive tutorial on Getting started with tskit. Below we just give a quick flavour of the Python API (note that APIs in C and Rust exist, and it is also possible to interface to the Python library in R).

Basic properties

Any tree sequence, such as one generated by msprime, can be loaded, and a summary table printed. This example uses a small tree sequence, but the tskit library scales effectively to ones encoding millions of genomes and variable sites.

import tskit

ts = tskit.load("data/basic_tree_seq.trees")  # Or generate using e.g. msprime.sim_ancestry()
ts  # In a Jupyter notebook this displays a summary table. Otherwise use print(ts)
Tree Sequence
Trees4
Sequence Length10000.0
Time Unitsgenerations
Sample Nodes6
Total Size3.5 KiB
MetadataNo Metadata
Table Rows Size Has Metadata
Edges 20 648 Bytes
Individuals 3 108 Bytes
Migrations 0 8 Bytes
Mutations 5 201 Bytes
Nodes 14 400 Bytes
Populations 1 224 Bytes
Provenances 2 1.7 KiB
Sites 5 141 Bytes

Individual trees

You can get e.g. the first tree in the tree sequence and analyse it.

first_tree = ts.first()
print("Total branch length in first tree is", first_tree.total_branch_length, ts.time_units)
print("The first of", ts.num_trees, "trees is plotted below")
first_tree.draw_svg(y_axis=True)  # plot the tree: only useful for small trees
Total branch length in first tree is 4496.0 generations
The first of 4 trees is plotted below
_images/quickstart_5_1.svg

Extracting genetic data

A tree sequence provides an extremely compact way to store genetic variation data. The trees allow this data to be decoded at each site:

for variant in ts.variants():
    print(
        "Variable site", variant.site.id,
        "at genome position", variant.site.position,
        ":", [variant.alleles[g] for g in variant.genotypes],
    )
Variable site 0 at genome position 536.0 : ['A', 'A', 'A', 'A', 'G', 'A']
Variable site 1 at genome position 2447.0 : ['C', 'G', 'G', 'G', 'G', 'G']
Variable site 2 at genome position 6947.0 : ['G', 'C', 'C', 'C', 'C', 'C']
Variable site 3 at genome position 7868.0 : ['C', 'C', 'C', 'C', 'C', 'T']
Variable site 4 at genome position 8268.0 : ['C', 'C', 'C', 'C', 'T', 'C']

Analysis

Tree sequences enable efficient analysis of genetic variation using a comprehensive range of built-in Statistics:

genetic_diversity = ts.diversity()
print("Av. genetic diversity across the genome is", genetic_diversity)

branch_diversity = ts.diversity(mode="branch")
print("Av. genealogical dist. between pairs of tips is", branch_diversity,  ts.time_units)
Av. genetic diversity across the genome is 0.00016666666666666666
Av. genealogical dist. between pairs of tips is 1645.8752266666668 generations

Plotting the whole tree sequence

This can give you a visual feel for small genealogies:

ts.draw_svg(
    size=(800, 300),
    y_axis=True,
    mutation_labels={m.id: m.derived_state for m in ts.mutations()},
)
_images/quickstart_11_0.svg

Underlying data structures

The data that defines a tree sequence is stored in a set of tables. These tables can be viewed, and copies of the tables can be edited to create a new tree sequence.

# The sites table is one of several tables that underlie a tree sequence
ts.tables.sites
idpositionancestral_statemetadata
0536A
12447G
26947C
37868C
48268C

The rest of this documentation gives a comprehensive description of the entire tskit library, including descriptions and definitions of all the tables.