# What is a tree sequence?¶

A succinct tree sequence, or “tree sequence” for short, represents the relationships between a set of DNA sequences. Tree sequences are based on fundamental biological principles of inheritance, DNA duplication, and recombination; they can be created by simulation or by inferring relationships from empirical DNA data.

Tree sequences provide an efficient way of storing genetic data, and can power analyses of millions of whole genomes. Plots (a) and (b) summarize results presented further down in this tutorial.

from IPython.display import SVG
import matplotlib_inline
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')

data1 = np.genfromtxt("data/storing_everyone.csv", delimiter=",", usecols=np.arange(1,12), names=True)
data2 = np.genfromtxt("data/benchmarks_without_copy_longer_genome.txt", encoding=None, names=True, dtype=None)
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(16, 4.5))
keep = data1['sample_size'] <= 1e6
x, y = data1['sample_size'][keep], data1['tsk_fit'][keep]/data1['vcf_fit'][keep]
ax1.spines["top"].set_visible(False)
ax1.spines["right"].set_visible(False)
ax1.loglog(x, y, c="C0", linewidth=4)
ax1.set_xlabel('# of 100Mb genomes', fontsize=18)
ax1.set_ylabel('Size of tree sequence\nfile (relative to VCF) ', fontsize=18)
ax1.tick_params(axis="both", labelsize=16)

txt = ax1.text(0.5, 1.3, "(a) Storing a million genomes as a tree sequence takes thousands of times less disk space",
ha='center', va='top', transform=ax1.transAxes, wrap=True, size=24)
txt._get_wrap_line_width = lambda: 600

ts_time = {n: t for s, n, t in data2[['toolkit','nsam','seconds']] if s == 'tskit'}
libseq_time = {n: t for s, n, t in data2[['toolkit','nsam','seconds']] if s == 'libseq'}
x = np.unique(list(ts_time.keys()) + list(libseq_time.keys()))
y = np.array([libseq_time[time]/ts_time[time] for time in x])
ax2.spines["top"].set_visible(False)
ax2.spines["right"].set_visible(False)
ax2.loglog(x, y, linewidth=4)
ax2.set_xlabel("# of genomes", fontsize=18)
ax2.set_ylabel("Tajima's D calculations per\nunit time (relative to libseq)", fontsize=18)
ax2.tick_params(axis="both", labelsize=16)
txt = ax2.text(0.5, 1.3, "(b) Genetic calculations on millions of genomes can be sped up by many orders of magnitude",
ha='center', va='top', transform=ax2.transAxes, wrap=True, size=24
)
txt._get_wrap_line_width = lambda: 600
plt.show()


As the name suggests, the simplest way to think about a tree sequence is as a sequence of “local trees” — i.e. trees located at different points along the chromosome. Here’s a tiny example based on ten genomes, $$\mathrm{a}$$ to $$\mathrm{j}$$, spanning a short 1kb chromosome.

import string
import msprime
from IPython.display import SVG

seed = 3096  # chosen to create a nice example simulation
ts = msprime.sim_ancestry(5, population_size=1e4, sequence_length=1000,
recombination_rate=1e-8, random_seed=seed)
# Extra code to label and order the tips alphabetically rather than numerically
labels = {i: string.ascii_lowercase[i] for i in range(ts.num_nodes)}
genome_order = [n for n in ts.first().nodes(order="minlex_postorder") if ts.node(n).is_sample()]
labels.update({n: labels[i] for i, n in enumerate(genome_order)})
style1 = (
".node:not(.sample) > .sym, .node:not(.sample) > .lab {visibility: hidden;}"
".mut {font-size: 12px}")
sz = (800, 250)  # size of the plot, slightly larger than the default

SVG(ts.draw_svg(size=sz, node_labels=labels, style=style1))


The tickmarks on the X axis and background shading indicates the genomic positions covered by the trees. For almost three quarters of the chromosome, from the start until position 715, the relationships between the ten genomes are shown by the first tree. The second tree shows the relationships between positions 715 and 932, and the third from position 932 to the end. We can say that the first tree spans 715bp, the second 217bp, and the third 68bp.

Multiple trees are needed because of genetic recombination, which causes different regions of the chromosome to have different histories. Together, the sequence of trees describe the full genetic ancestry, or genetic genealogy, of our 10 genomes.

## An efficient encoding of DNA data¶

A tree sequence can be used to describe patterns of genetic variation by combining the trees with a knowledge of where mutations occur on their branches. Here’s how that might look in our simple example:

seed = 244  # a simpler example uses 214
mutated_ts = msprime.sim_mutations(ts, rate=5e-8, random_seed=seed)

mut_labels = {}  # An array of labels for the mutations, listing position & allele change
l = "{:g} ({}→{})"
for mut in mutated_ts.mutations():  # This entire loop is just to make pretty labels
site = mutated_ts.site(mut.site)
older_mut = mut.parent >= 0  # is there an older mutation at the same position?
prev = mutated_ts.mutation(mut.parent).derived_state if older_mut else site.ancestral_state
mut_labels[mut.id] = l.format(site.position, prev, mut.derived_state)

SVG(mutated_ts.draw_svg(
size=sz, style=style1,
node_labels=labels, mutation_labels=mut_labels))


There are now ten single nucleotide mutations in the tree sequence. They are shown on the branches of the trees, and the positions of the ten variable sites associated with the mutations are shown along the X axis.

The trees inform us that, for example, the final mutation (at position 986) is inherited by genomes $$\mathrm{a}$$ to $$\mathrm{i}$$. These genomes must have a G at that position, compared to the original value of C. In other words, once we know the ancestry, placing a relatively small number of mutations is enough to explain all the observed genetic variation. Here’s the resulting “variant matrix”:

haplotypes = ["   ".join(h) for h in mutated_ts.haplotypes()]
print("Position  " + " ".join(str(int(s.position)) for s in mutated_ts.sites()))
print("\n".join(sorted([f"Genome {labels[i]}:  {h}" for i, h in enumerate(haplotypes)])))

Position  200 417 428 467 588 667 800 869 875 986
Genome a:  G   A   T   A   G   C   A   A   T   G
Genome b:  G   A   T   A   G   C   A   A   T   G
Genome c:  G   T   T   A   G   C   A   A   T   G
Genome d:  T   T   G   T   G   C   A   A   T   G
Genome e:  T   T   G   T   G   T   A   A   T   G
Genome f:  T   T   G   T   G   T   A   A   T   G
Genome g:  T   T   G   T   G   C   A   A   T   G
Genome h:  T   T   G   T   G   C   A   A   T   G
Genome i:  T   T   G   T   G   C   C   T   C   G
Genome j:  T   T   G   T   A   C   C   T   C   C


This approach scales effectively to millions of genomes, and to chromosomes of hundreds of megabases in length. The ability to deal with huge datasets comes down to one key feature of genomic data: adjacent trees along a chromosome are highly correlated, that is, they share structure. In our example this becomes evident if we highlight the branches (“edges” in tree sequence terminology) that remain unchanged between the first and the second tree.

# Highlight certain edges in certain trees. Other visualization possibilities in tutorials/viz.html
kept_edges = [e for e in ts.edges() if e.left==0 and e.right>ts.breakpoints(True)[1]]
style3 = (
",".join(f"#svg1 .tree:not(.t2) .node.a{e.parent}.n{e.child} > .edge" for e in kept_edges)
+ "{stroke:#00DD00; stroke-width: 2px}"
+ style1)
sz = (500, 250)
SVG(ts.draw_svg(
size=sz, x_lim=(0, 900), root_svg_attributes={'id':'svg1'}, node_labels=labels, style=style3))


A branch can be shared by many adjacent trees, but is stored just once in the tree sequence. For large datasets this is a great saving, because typically each tree-change affects only a few branches at a time, regardless of the tree size. Here’s the take-home message:

Tree sequences are efficient because they don’t store each tree separately

Below is an extension of the plot at the top of this page, showing predicted file sizes when storing not just millions, but billions of human-like genomes: enough to encompass every human on the planet. This demonstrates that the tree sequence encoding leads to savings of many orders of magnitude, even when compared against compressed versions of the standard VCF storage format (original published data here). It’s also worth noting that the efficiency extends to processing time too: tree sequences are often several orders of magnitude faster to process than other storage formats.

x = data1['sample_size']
fig, ax1 = plt.subplots(1, figsize=(10, 4))
ax1.spines["top"].set_visible(False)
ax1.spines["right"].set_visible(False)

plt.loglog(x,  data1['vcf_fit'], c="C1", label="VCF", linewidth=2)
plt.loglog(x,  data1['vcfz_fit'], c="C1", label="compressed VCF", linewidth=2, linestyle=":")

plt.loglog(x, data1['tsk_fit'], c="C0", label="tree sequence", linewidth=2)
plt.loglog(x, data1['tskz_fit'], c="C0", label="compressed tree sequence", linewidth=2, linestyle=":")

plt.xlabel('Number of 100Mb genomes (log scale)', fontsize=12)
plt.ylabel('Space required (GB, log scale)', fontsize=12)
plt.text(16e9, 0.001, 'Size of\nentire\nhuman\npopulation', ha="center", va="bottom", size=14)
plt.annotate('', xy=(16e9, 0.0001), xytext=(16e9, 0.001),
arrowprops=dict(facecolor='black', shrink=0))
plt.legend()
plt.show()


## A record of genetic ancestry¶

Often, we’re not interested so much in the DNA sequence data as the genetic ancestry itself (as discussed in this summary). In other words, the main consideration is the actual trees in a tree sequence, rather than the distributions of mutations placed upon them (indeed in simulations, it may not be necessary to incorporate neutral mutations at all). The trees can be used, for example, to determine the origin and age of alleles under selection, to capture the spatial structure of populations, or to uncover the effects of hybridization and admixture in the past.

Todo

Insert illustration of the above, e.g. use of branch length calculations rather than variants using colours for different branch lengths, or possibly a simple view of a tree sequence over geographical space.

A major benefit of “tree sequence thinking” is the close relationship between the tree sequence and the underlying biological processes that produced the genetic sequences in the first place. For example, each branch point (or “internal node”) in one of the trees above can be imagined as a genome which existed at a specific time in the past, and which is a “most recent common ancestor” (MRCA) of the descendant genomes at that position on the chromosome.

Todo

The tree sequence format gives us access to those ancestral genomes: insert diagram of reconstructed (partial) ancestral haplotypes

We can mark these extra “ancestral genomes” on our picture, although is helpful to distinguish them from the sampled genomes ($$\mathrm{a}$$ to $$\mathrm{j}$$) we have been talking about up to this point. Here we’ll plot the MRCA genomes as circular nodes, rather than the squares we have used previously.

style2 = "#svg2 .node > .sym {visibility: visible;}"  # force-show all nodes: not normally needed
SVG(ts.draw_svg(size=sz, root_svg_attributes={'id':'svg2'}, node_labels=labels, style=style2))


Note

For clarity in these examples, we have been using letters to label nodes, but normally the nodes are referred to by number.

Todo

Mention ARGs in passing and link out to the ARG tutorial.

## A framework for efficient computation¶

Todo

Introduction: algorithms on trees are known to be efficient (phylogenetics). We extend these to multiple correlated trees. Mention “dynamic programming” in passing.

Statistical measures of genetic variation can be thought of as a calculation combining the local trees with the mutations on each branch (or, often preferably, the length of the branches: see this summary). Because a tree sequence is built on a set of small branch changes along the chromosome, statistical calculations can often be updated incrementally as we move along the genome, without having to perform the calculation de novo on each tree. When perfoming calculations on large datasets, this can result in speed-ups of many orders of magnitude, as in this example of calculating Tajima’s D (from this source)

ts_time = np.array([[n,t] for s, n, t in data2[['toolkit','nsam','seconds']] if s == 'tskit'])
ska_time = np.array([[n, t] for s, n, t in data2[['toolkit','nsam','seconds']] if s == 'allel'])
libseq_time = np.array([[n, t] for s, n, t in data2[['toolkit','nsam','seconds']] if s == 'libseq'])
fig, ax1 = plt.subplots(1, figsize=(10, 5))
ax1.spines["top"].set_visible(False)
ax1.spines["right"].set_visible(False)
ax1.loglog(ska_time[:,0], ska_time[:,1], c="C3", linewidth=2, label="scikit-allel library")
ax1.loglog(libseq_time[:,0], libseq_time[:,1], c="C1", linewidth=2, label="libseq library")
ax1.loglog(ts_time[:,0], ts_time[:,1], c="C0", linewidth=2, label="tree sequence method")
ax1.set_ylabel("Time to calculate Tajima's D (secs/site)", fontsize=12)
ax1.set_xlabel("Number of sampled genomes", fontsize=12)
plt.legend()
plt.show()


Todo

Very brief discussion of efficient counting of topologies, i.e. the combinatorics module

Summary of this subsection:

Genetic calculations involve iterating over trees, which is highly efficient in tskit