Quickstart#

Our tutorials site has a more extensive tutorial on Getting started with tskit. Below we just give a quick flavour of the Python API (note that APIs in C and Rust exist, and it is also possible to interface to the Python library in R).

Basic properties#

Any tree sequence, such as one generated by msprime, can be loaded, and a summary table printed. This example uses a small tree sequence, but the tskit library scales effectively to ones encoding millions of genomes and variable sites.

import tskit

ts = tskit.load("data/basic_tree_seq.trees")  # Or generate using e.g. msprime.sim_ancestry()
ts  # In a Jupyter notebook this displays a summary table. Otherwise use print(ts)
Tree Sequence
Trees4
Sequence Length10,000.0
Time Unitsgenerations
Sample Nodes6
Total Size3.5 KiB
MetadataNo Metadata
Table Rows Size Has Metadata
Edges 20 648 Bytes
Individuals 3 108 Bytes
Migrations 0 8 Bytes
Mutations 5 201 Bytes
Nodes 14 400 Bytes
Populations 1 224 Bytes
Provenances 2 1.7 KiB
Sites 5 141 Bytes
Provenance Timestamp Software Name Version Command Full record
14 July, 2022 at 09:51:11 PM msprime 1.2.0 sim_mutations
Details
dict schema_version: 1.0.0
software:
dict name: msprime
version: 1.2.0

parameters:
dict command: sim_mutations
tree_sequence:
dict __constant__: __current_ts__

rate: 2e-07
model: None
start_time: None
end_time: None
discrete_genome: None
keep: None
random_seed: 123

environment:
dict
os:
dict system: Darwin
node: Yans-New-Air
release: 20.6.0
version: Darwin Kernel Version 20.6.0:
Tue Feb 22 21:10:41 PST 2022;
root:xnu-
7195.141.26~1/RELEASE_X86_64
machine: x86_64

python:
dict implementation: CPython
version: 3.9.10

libraries:
dict
kastore:
dict version: 2.1.1

tskit:
dict version: 0.5.1

gsl:
dict version: 2.7



14 July, 2022 at 09:51:11 PM msprime 1.2.0 sim_ancestry
Details
dict schema_version: 1.0.0
software:
dict name: msprime
version: 1.2.0

parameters:
dict command: sim_ancestry
samples: 3
demography: None
sequence_length: 10000.0
discrete_genome: None
recombination_rate: 1e-07
gene_conversion_rate: None
gene_conversion_tract_length: None
population_size: 1000
ploidy: None
model: dtwf
initial_state: None
start_time: None
end_time: None
record_migrations: None
record_full_arg: None
num_labels: None
random_seed: 665
replicate_index: 0

environment:
dict
os:
dict system: Darwin
node: Yans-New-Air
release: 20.6.0
version: Darwin Kernel Version 20.6.0:
Tue Feb 22 21:10:41 PST 2022;
root:xnu-
7195.141.26~1/RELEASE_X86_64
machine: x86_64

python:
dict implementation: CPython
version: 3.9.10

libraries:
dict
kastore:
dict version: 2.1.1

tskit:
dict version: 0.5.1

gsl:
dict version: 2.7



Individual trees#

You can get e.g. the first tree in the tree sequence and analyse it.

first_tree = ts.first()
print("Total branch length in first tree is", first_tree.total_branch_length, ts.time_units)
print("The first of", ts.num_trees, "trees is plotted below")
first_tree.draw_svg(y_axis=True)  # plot the tree: only useful for small trees
Total branch length in first tree is 4496.0 generations
The first of 4 trees is plotted below
_images/89bcfe4c49556fb3f1f30e69db0c5237ad3efeaa482325d3a3e8d18ce171a787.svg

Extracting genetic data#

A tree sequence provides an extremely compact way to store genetic variation data. The trees allow this data to be decoded at each site:

for variant in ts.variants():
    print(
        "Variable site", variant.site.id,
        "at genome position", variant.site.position,
        ":", [variant.alleles[g] for g in variant.genotypes],
    )
Variable site 0 at genome position 536.0 : ['A', 'A', 'A', 'A', 'G', 'A']
Variable site 1 at genome position 2447.0 : ['C', 'G', 'G', 'G', 'G', 'G']
Variable site 2 at genome position 6947.0 : ['G', 'C', 'C', 'C', 'C', 'C']
Variable site 3 at genome position 7868.0 : ['C', 'C', 'C', 'C', 'C', 'T']
Variable site 4 at genome position 8268.0 : ['C', 'C', 'C', 'C', 'T', 'C']

Analysis#

Tree sequences enable efficient analysis of genetic variation using a comprehensive range of built-in Statistics:

genetic_diversity = ts.diversity()
print("Av. genetic diversity across the genome is", genetic_diversity)

branch_diversity = ts.diversity(mode="branch")
print("Av. genealogical dist. between pairs of tips is", branch_diversity,  ts.time_units)
Av. genetic diversity across the genome is 0.00016666666666666666
Av. genealogical dist. between pairs of tips is 1645.8752266666668 generations

Plotting the whole tree sequence#

This can give you a visual feel for small genealogies:

ts.draw_svg(
    size=(800, 300),
    y_axis=True,
    mutation_labels={m.id: m.derived_state for m in ts.mutations()},
)
_images/7863457f0c9262fa79477dd8b61e3e65da9187d32b015a3e9a45957f875014d8.svg

Underlying data structures#

The data that defines a tree sequence is stored in a set of tables. These tables can be viewed, and copies of the tables can be edited to create a new tree sequence.

# The sites table is one of several tables that underlie a tree sequence
ts.tables.sites
idpositionancestral_statemetadata
0536A
12,447G
26,947C
37,868C
48,268C

The rest of this documentation gives a comprehensive description of the entire tskit library, including descriptions and definitions of all the tables.