What is a tree sequence?

A succinct tree sequence, or “tree sequence” for short, represents the evolutionary relationships between a set of DNA sequences. Tree sequences are based on fundamental biological principles of inheritance, DNA duplication, and recombination; they can be created by simulation or by inferring relationships from empirical DNA data.

Tree sequences provide an efficient way of storing genetic variation data, and can power analyses of millions of whole genomes. Plots (a) and (b) summarize results presented further down this tutorial.

_images/what_is_2_0.svg

As the name suggests, the simplest way to think about a tree sequence is as a sequence of “local trees” — i.e. trees located at different points along the chromosome. Here’s a tiny example based on ten genomes, \(\mathrm{a}\) to \(\mathrm{j}\), spanning a short 1000 letter chromosome.

import string
import tskit
from IPython.display import SVG

mutated_ts = tskit.load("data/whatis_example.trees")
ts = mutated_ts.delete_sites(list(range(mutated_ts.num_sites)))
# Extra code to label and order the tips alphabetically rather than numerically
labels = {i: string.ascii_lowercase[i] for i in range(ts.num_nodes)}
genome_order = [n for n in ts.first().nodes(order="minlex_postorder") if ts.node(n).is_sample()]
labels.update({n: labels[i] for i, n in enumerate(genome_order)})
style1 = (
    ".node:not(.sample) > .sym, .node:not(.sample) > .lab {visibility: hidden;}"
    ".mut {font-size: 12px} .y-axis .tick .lab {font-size: 85%}")
sz = (800, 250)  # size of the plot, slightly larger than the default
ticks = [0, 5000, 10000, 15000, 20000]
SVG(ts.draw_svg(
    size=sz, node_labels=labels, style=style1, y_label="Time ago",
    y_axis=True, y_ticks=ticks))
_images/what_is_4_0.svg

The tickmarks on the X axis and background shading indicate the genomic positions covered by the trees. For just over half the chromosome, from the start until position 580, the relationships between the ten genomes are shown by the first tree. The second tree shows the relationships between positions 580 and 833, and the third from position 833 to the end. We can say that the first tree spans 580 base pairs, the second 253, and the third 167.

Multiple trees are needed because of genetic recombination, which causes different regions of the chromosome to have different histories. Together, the sequence of trees describe the full genetic ancestry, or genetic genealogy, of our 10 genomes.

An efficient encoding of DNA data

A tree sequence can be used to describe patterns of genetic variation by combining the trees with a knowledge of where mutations occur on their branches. Here’s how that might look in our simple example:

mut_labels = {}  # An array of labels for the mutations, listing position & allele change
l = "{:g} ({}{})"
for mut in mutated_ts.mutations():  # This entire loop is just to make pretty labels
    site = mutated_ts.site(mut.site)
    older_mut = mut.parent >= 0  # is there an older mutation at the same position?
    prev = mutated_ts.mutation(mut.parent).derived_state if older_mut else site.ancestral_state
    mut_labels[mut.id] = l.format(site.position, prev, mut.derived_state)

SVG(mutated_ts.draw_svg(
    size=sz, style=style1, node_labels=labels, mutation_labels=mut_labels))
_images/what_is_6_0.svg

There are now ten single nucleotide mutations in the tree sequence. They are shown on the branches of the trees, and the positions of the ten variable sites associated with the mutations are shown along the X axis.

The trees inform us that, for example, the final mutation (at position 995) is inherited by genomes \(\mathrm{b}\) to \(\mathrm{i}\). These genomes must have an A at that position, compared to the original value of G. In other words, once we know the ancestry, placing a relatively small number of mutations is enough to explain all the observed genetic variation. Here’s the resulting “variant matrix”:

haplotypes = ["   ".join(h) for h in mutated_ts.haplotypes()]
print("Position: " + " ".join(str(int(s.position)) for s in mutated_ts.sites()))
print("\n".join(sorted(
    [f"Genome {labels[i]}:  {h}" for i, h in zip(mutated_ts.samples(), haplotypes)])))
Position: 267 283 301 334 393 654 699 721 809 995
Genome a:  A   C   G   T   T   G   G   A   A   G
Genome b:  A   C   G   T   T   G   G   C   T   A
Genome c:  A   C   G   T   T   G   G   C   T   A
Genome d:  A   C   G   T   T   G   G   C   T   A
Genome e:  A   C   G   C   T   G   G   C   T   A
Genome f:  A   C   G   C   T   G   G   C   T   A
Genome g:  A   C   G   T   T   G   G   C   T   A
Genome h:  G   G   T   T   C   G   G   C   T   A
Genome i:  G   G   T   T   C   A   G   C   T   A
Genome j:  G   G   G   T   C   G   T   C   A   G

This approach scales effectively to millions of genomes, and to chromosomes of hundreds of megabases in length. The ability to deal with huge datasets comes down to one key feature of genomic data: adjacent trees along a chromosome are highly correlated, that is, they share structure. In our example this becomes evident if we highlight the branches (“edges” in tree sequence terminology) that remain unchanged between the first and the second tree.

# Highlight certain edges in certain trees. Other visualization possibilities in tutorials/viz.html
kept_edges = [e for e in ts.edges() if e.left==0 and e.right>ts.breakpoints(True)[1]]
style3 = (
    ",".join(f"#svg1 .tree:not(.t2) .node.a{e.parent}.n{e.child} > .edge" for e in kept_edges)
    + "{stroke:#00DD00; stroke-width: 2px}"
    + style1)
SVG(ts.draw_svg(
    size=(500, 250), x_lim=(0, 800), root_svg_attributes={'id':'svg1'},  y_ticks=ticks,
    node_labels=labels, style=style3))
_images/what_is_10_0.svg

A branch can be shared by many adjacent trees, but is stored as a single edge in the tree sequence. For large datasets this is a great saving, because typically each tree-change affects only a few branches at a time, regardless of the tree size.

Below is an extension of the plot at the top of this page, showing predicted file sizes when storing not just millions, but billions of human-like genomes: enough to encompass every human on the planet. This demonstrates that the tree sequence encoding leads to savings of many orders of magnitude, even when compared against compressed versions of the standard VCF storage format (original published data here). It’s also worth noting that the efficiency extends to processing time too: tree sequences are often several orders of magnitude faster to process than other storage formats.

_images/what_is_12_0.svg

A record of genetic ancestry

Often, we’re not interested so much in the DNA sequence data as the genetic ancestry itself (discussed e.g. here). In other words, the main consideration is the actual trees in a tree sequence, rather than the distributions of mutations placed upon them — indeed in genetic simulations, it may not be necessary to incorporate neutral mutations at all. The trees reflect, for example, the origin and age of alleles under selection, the spatial structure of populations, and the effects of hybridization and admixture in the past.

The tree sequence in this tutorial was actually generated using a model of population splits and expansions as shown in the following schematic, plotted using the DemesDraw package. Our 10 genomes were sampled from modern day populations A (a constant-size population) and B (a recently expanding one).

_images/what_is_14_0.svg

A major benefit of “tree sequence thinking” is the close relationship between the tree sequence and the underlying biological processes that produced the genetic sequences in the first place, such as those pictured in the demography above. For example, each branch point (or “internal node”) in one of our trees can be imagined as a genome which existed at a specific time in the past, and which is a “most recent common ancestor” (MRCA) of the descendant genomes at that position on the chromosome. We can mark these extra “ancestral genomes” on our tree diagrams, distinguishing them from the sampled genomes (\(\mathrm{a}\) to \(\mathrm{j}\)) by using circular symbols. We can even colour the nodes by the population that we know (or infer) them to belong to at the time:

colours = {0: "#1f77b4", 1: "#ff7f0e", 2: "#2ca02c"}
style2 = ".y-axis .tick .lab {font-size: 85%}"
style2 += "#svg2 .node > .sym {visibility: visible;}"  # force-show all nodes: not normally needed
style2 += "".join([f".p{n.population} > .sym {{fill: {colours[n.population]}}}" for n in ts.nodes()])

SVG(mutated_ts.draw_svg(
    size=sz, root_svg_attributes={'id':'svg2'}, y_label="Time ago (generations)",
    y_axis=True, y_ticks=ticks, node_labels=labels, mutation_labels={}, style=style2))
_images/what_is_16_0.svg

The diagram shows that most of the ancestral genomes \(\mathrm{k}\) to \(\mathrm{u}\) lived much longer ago than the population split, 1000 generations back, and resided in the ancestral (blue) population. The tree sequence also allows us to easily deduce these MRCA genomes, simply by looking at which mutations they have inherited:

import numpy as np
tables = mutated_ts.dump_tables()
# Flip sample and nonsample flags, making the haplotypes() method print out nonsample nodes
s_flags = tables.nodes.flags[ts.samples()[0]]
no_flags = s_flags-s_flags
tables.nodes.flags = np.where(tables.nodes.flags & tskit.NODE_IS_SAMPLE, no_flags, s_flags)
ts_flipped = tables.tree_sequence()
haplotypes = ["   ".join(h) for h in ts_flipped.haplotypes(missing_data_character=" ")]
print(" " * ts_flipped.num_sites, " " * (ts_flipped.num_sites-4), "")
print(
    "||ANCESTRAL GENOMES||      Position:",
    " ".join(str(int(s.position)) for s in ts_flipped.sites()))
print(
    "\n".join(reversed(sorted([
        f"Genome {labels[i]} (time {ts.node(i).time:7.1f} in the past):  {h}"
        for i, h in zip(ts_flipped.samples(), haplotypes)]))))
                  
||ANCESTRAL GENOMES||      Position: 267 283 301 334 393 654 699 721 809 995
Genome u (time 14237.0 in the past):  G   C   G   T   C   G   G   C   A   G
Genome t (time 13183.7 in the past):  G   G   G   T   C   G   G   C   A   G
Genome s (time  6169.6 in the past):                      G   G   C   T   A
Genome r (time  6036.2 in the past):  A   C   G   T   T                    
Genome q (time  3887.8 in the past):  A   C   G   T   T   G   G   C   T   A
Genome p (time  3311.5 in the past):  A   C   G   T   T   G   G   C   T   A
Genome o (time  3218.8 in the past):                                      A
Genome n (time  3213.6 in the past):  G   G   T   T   C   G   G   C   T    
Genome m (time  3131.7 in the past):  A   C   G   T   T   G   G   C   T   A
Genome l (time   705.4 in the past):  A   C   G   T   T   G   G   C   T   A
Genome k (time   318.8 in the past):  A   C   G   C   T   G   G   C   T   A

You can see that some ancestors are missing genomic regions, because those parts of their genome have not been inherited by any of the sampled genomes. In other words, that ancestral node is not present in the corresponding local tree.

A framework for efficient computation

Using tree structures is a common way to implement efficient computer algorithms, and many phylogenetic methods use the structure provided by the evolutionary tree to implement efficient dynamic programming algorithms. The tree sequence structure allows these approaches to be extended to the particular form of phylogenetic network defined by multiple correlated trees along a genome.

For example, statistical measures of genetic variation can be thought of as a calculation combining the local trees with the mutations on each branch (or, often preferably, the length of the branches: see this summary). Because a tree sequence is built on a set of small branch changes along the chromosome, statistical calculations can often be updated incrementally as we move along the genome, without having to perform the calculation de novo on each tree. Using tree sequences can result in speed-ups of many orders of magnitude when perfoming calculations on large datasets, as in this example of calculating Tajima’s D (from here):

_images/what_is_20_0.svg

The tskit library has extensive support for these sorts of population genetic calculations. It provides efficient methods for traversing through large trees and tree sequences, as well as providing other phylogenetically relevant methods such as parsimonious placement of mutations, and the counting of topologies embedded within larger trees.

If you are new to tree sequences, and want to start finding out about tskit, you might now want to continue to the next tutorial: Terminology & concepts.

Further reading