Quickstart
Contents
Quickstart#
Our tutorials site has a more extensive tutorial on Getting started with tskit. Below we just give a quick flavour of the Python API (note that APIs in C and Rust exist, and it is also possible to interface to the Python library in R).
Basic properties#
Any tree sequence, such as one generated by msprime, can be
loaded, and a summary table printed. This example uses a small tree sequence, but the
tskit
library scales effectively to ones encoding millions of genomes and variable
sites.
import tskit
ts = tskit.load("data/basic_tree_seq.trees") # Or generate using e.g. msprime.sim_ancestry()
ts # In a Jupyter notebook this displays a summary table. Otherwise use print(ts)
|
|
---|---|
Trees | 4 |
Sequence Length | 10000.0 |
Time Units | generations |
Sample Nodes | 6 |
Total Size | 3.5 KiB |
Metadata | No Metadata |
Table | Rows | Size | Has Metadata |
---|---|---|---|
Edges | 20 | 648 Bytes | |
Individuals | 3 | 108 Bytes | |
Migrations | 0 | 8 Bytes | |
Mutations | 5 | 201 Bytes | |
Nodes | 14 | 400 Bytes | |
Populations | 1 | 224 Bytes | ✅ |
Provenances | 2 | 1.7 KiB | |
Sites | 5 | 141 Bytes |
Individual trees#
You can get e.g. the first tree in the tree sequence and analyse it.
first_tree = ts.first()
print("Total branch length in first tree is", first_tree.total_branch_length, ts.time_units)
print("The first of", ts.num_trees, "trees is plotted below")
first_tree.draw_svg(y_axis=True) # plot the tree: only useful for small trees
Total branch length in first tree is 4496.0 generations
The first of 4 trees is plotted below
Extracting genetic data#
A tree sequence provides an extremely compact way to
store genetic variation data. The trees allow
this data to be decoded
at each site:
for variant in ts.variants():
print(
"Variable site", variant.site.id,
"at genome position", variant.site.position,
":", [variant.alleles[g] for g in variant.genotypes],
)
Variable site 0 at genome position 536.0 : ['A', 'A', 'A', 'A', 'G', 'A']
Variable site 1 at genome position 2447.0 : ['C', 'G', 'G', 'G', 'G', 'G']
Variable site 2 at genome position 6947.0 : ['G', 'C', 'C', 'C', 'C', 'C']
Variable site 3 at genome position 7868.0 : ['C', 'C', 'C', 'C', 'C', 'T']
Variable site 4 at genome position 8268.0 : ['C', 'C', 'C', 'C', 'T', 'C']
Analysis#
Tree sequences enable efficient analysis of genetic variation using a comprehensive range of built-in Statistics:
genetic_diversity = ts.diversity()
print("Av. genetic diversity across the genome is", genetic_diversity)
branch_diversity = ts.diversity(mode="branch")
print("Av. genealogical dist. between pairs of tips is", branch_diversity, ts.time_units)
Av. genetic diversity across the genome is 0.00016666666666666666
Av. genealogical dist. between pairs of tips is 1645.8752266666668 generations
Plotting the whole tree sequence#
This can give you a visual feel for small genealogies:
ts.draw_svg(
size=(800, 300),
y_axis=True,
mutation_labels={m.id: m.derived_state for m in ts.mutations()},
)
Underlying data structures#
The data that defines a tree sequence is stored in a set of tables. These tables can be viewed, and copies of the tables can be edited to create a new tree sequence.
# The sites table is one of several tables that underlie a tree sequence
ts.tables.sites
id | position | ancestral_state | metadata |
---|---|---|---|
0 | 536 | A | |
1 | 2447 | G | |
2 | 6947 | C | |
3 | 7868 | C | |
4 | 8268 | C |
The rest of this documentation gives a comprehensive description of the entire tskit
library, including descriptions and definitions of all
the tables.