ARGs as tree sequences#

Note

This tutorial is a work in progress.

At its heart, a tskit tree sequence consists of a list of Nodes, and a list of Edges that connect those nodes. Therefore a succinct tree sequence is equivalent to a mathematical graph, which is additionally annotated with genomic positions such that at each position, a path through the edges exists which defines a tree. This graph interpretation of a tree sequence is tightly connected to the concept of an “ancestral recombination graph” (or ARG).

The term “ARG” is often used to refer to a structure consisting of nodes and edges that describe the genetic genealogy of a set of sampled chromosomes which have evolved via a process of genetic inheritance combined with recombination. ARGs may contain not just nodes corresponding to genetic coalescence, but also additional nodes that correspond e.g. to recombination events (see ARG nodes, below). We call these “full ARGs”, and this tutorial aims to show you how tskit can be used to store and analyse them. As an example, we will generate a full ARG using the process msprime.sim_ancestry() with the record_full_arg=True option, as described in the msprime docs:

import msprime

parameters = {
    "samples": 3, # Three diploid individuals == six sample genomes
    "sequence_length": 1e4,
    "recombination_rate": 1e-7,
    "population_size": 1e3,
    "random_seed": 333,
}

ts_arg = msprime.sim_ancestry(**parameters, record_full_arg=True, discrete_genome=False)
# NB: the strict Hudson ARG needs unique crossover positions (i.e. a continuous genome)

print('"Full ARG" simulated under the Hudson model:')
print(
    f" stored in a tree sequence with {ts_arg.num_nodes} nodes and "
    f" {ts_arg.num_edges} edges which form {ts_arg.num_trees} local trees"
)
"Full ARG" simulated under the Hudson model:
 stored in a tree sequence with 17 nodes and  18 edges which form 3 local trees

Like any tree sequence, we can also add mutations to the ARG to generate genetic variation:

import numpy as np
mu = 1e-7
ts_arg = msprime.sim_mutations(ts_arg, rate=mu, random_seed=888)
print("     Sample node:  " + "   ".join(str(u) for u in ts_arg.samples()))
for v in ts_arg.variants():
    print(f"Variable site {v.site.id}:", np.array(v.alleles)[v.genotypes])
     Sample node:  0   1   2   3   4   5
Variable site 0: ['T' 'C' 'T' 'C' 'C' 'C']
Variable site 1: ['C' 'C' 'C' 'C' 'G' 'C']
Variable site 2: ['T' 'T' 'G' 'T' 'T' 'T']

As well as the standard Visualization of the tree sequence as a set of local trees, we can also plot this tree sequence in network form:

Todo

Incorporate something into tsviz like the draw function from https://github.com/tskit-dev/what-is-an-arg-paper/blob/main/argutils/viz.py, and use that to plot a graph-based viz of this ts, e.g.

import argutils
from matplotlib import pyplot as plt
fig, ax = plt.subplots(1, 1, figsize=(5, 5), sharey=True)
ts2 = argutils.viz.label_nodes(ts_arg)
_ = argutils.draw(ts_arg, ax, draw_edge_widths=True)

Example ARG view

(this PNG file can be removed once the code to autogenerate it is incorporated)

Background semantics#

Two features distinguish the genealogical structure stored in a tskit tree sequence from many other ARG formats. Firstly, the annotations that define which genomic regions are inherited are stored on edges (via the left and a right properties), rather than on the graph nodes, as is sometimes the case. Secondly, the nodes in a tree sequence correspond to genomes, rather that specific events such as coalescence or recombination.

Technically therefore, ARGs stored by tskit are edge-annotated “genome ARGs” (gARGs). This results in a flexible format that can describe ancestral graphs created by many biological processes, including ones that deviate from the neutral coalescent-with-recombination (CwR), for example ancestries incorporating gene conversion, or that have evolved under a Wright-Fisher model of inheritance, in which parents can have more than two children, and coalescence and recombination can occur in the same generation. The focus on genomes rather than events also makes it possible to accurately encode ancestry without having to pin down exactly when the relevant ancestral events took place (TODO: cite our ARG paper).

ARG nodes#

Simplified tree sequences, such those normally produced by msprime, can be though of as a “simplified ARGs” that contain only nodes that correspond to a coalescence somewhere in the genome. They are sufficient to capture the structure of local trees and the correlations between them; this is usually all that is needed for analysis. However they do not contain complete information about the timings and topological operations associated with recombination events. This extra information can be useful for a few specific purposes:

  1. Assuming each recombination happens at a unique position, precise information about which lineages are involved in recombination allows you to work out the exact tree editing, or subtree-prune-and-regraft (SPR) moves required to change one local tree into another as you move along the genome.

  2. Information about recombination and common ancestor events can be used to calculate the likelihood of an full ARG under a specific model of evolution (most commonly, the neutral coalescent with recombination, or CwR, as modelled e.g. by Hudson (1983))

Note, however, that it can be impossible to infer non-coalescent nodes from genetic variation data with any degree of precision.

Unary tree nodes#

To store additional information about non-coalescent nodes, a full ARG stored in tree sequence form contains extra unary nodes (i.e. nodes with only one child). In particular, it can contain recombination nodes which record the timing of recombination events, and non-coalescent-common-ancestor nodes which record cases where lineages share a common ancestor but in which genetic material does not coalesce.

The example we have been using is small, and contains just 2 recombination events (associated with 2 breakpoints). In this instance the only extra nodes happen to be recombination nodes. Msprime only simulates full ARGs in which a recombination event results in a single crossover, and it records this by storing the two genomes immediately prior to gamete formation (the genomes that come together to form a recombinant). In other words, two extra nodes are created for each recombination: one that captures transmission to the left of the crossover and another, at an identical time, to the right. These are identified by the NODE_IS_RE_EVENT flag, and are are highlighed in red below:

# Plot the recombination nodes in red, with a horizontal line at the time of occurrence,
# and only label nodes that are samples or recombination nodes.
samples = set(ts_arg.samples())
re_nodes = set(nd.id for nd in ts_arg.nodes() if nd.flags & msprime.NODE_IS_RE_EVENT)
re_times = [int(nd.time) for nd in ts_arg.nodes() if nd.flags & msprime.NODE_IS_RE_EVENT]
style = ".y-axis .grid {stroke: #ff000033} .mut .sym {stroke: goldenrod}"
for u in re_nodes:
    style += f".n{u} > .sym {{fill: red}}"
ts_arg.draw_svg(
    size=(600, 300),
    y_axis=True,
    y_ticks=re_times,
    y_gridlines=True,
    style=style,
    mutation_labels={},
    node_labels={u: u for u in samples | re_nodes}
)
_images/args_5_0.svg

The location of the recombination nodes imply that the recombination events happened ~588 and ~59 generations ago. The older one, at 588 generations, involved node 13 (to the left of position 2601.01) and node 14 (to the right). As well as narrowing down the recombination event to a specific point in time, the position of these two nodes tells us that the SPR to convert the first into the second tree involves pruning the branch above samples 1, 3, 4, and 5 and regrafting it onto the branch above samples 0 and 2, rather than the other way around. Note that this particular recombination does not change the topology of the tree, but simply the branch lengths.

The recombination event 59 generations ago involved nodes 7 and 8, with the crossover ocurring at position 6516.94. The SPR operation which converts the middle tree into the last one involves pruning the branch above sample node 5 and regrafting it onto the branch above the common ancestor of 1 and 3. In this case, the recombination has led to a change in topology, such that the closest relative of 5 is node 4 from positions 0 to 6516.94, but 1 and 3 from positions 6516.94 to 10,000.

Note

Many ARG representations associate each recombination event with a single node rather than two. It is possible to represent this in tskit, but in such an ARG, the edge annotations do not contain enough information to calculate the standard likelihood under the Hudson model (see Calculating likelihoods).

Todo

Explain in plain language why 2 RE nodes are needed to calculate the likelihood under the Hudson CwR: see e.g. https://github.com/tskit-dev/msprime/issues/1942#issuecomment-1013718650

One suggested way to do this is to show how there is not enough information in a 1-RE-node plot to fully recreate the 2-RE-node equivalent. I think this is because we lose information about the order of breakpoints when multiple breakpoints occur in the same region of hidden material.

Note also that this approach only applies to a model in which a single crossover occurs per chromosome.

Calculating likelihoods#

Because the ARG above was generated under the standard Hudson model (e.g. neutral evolution in a large population with unique recombination breakpoints along a continuous genome), we can calculate its likelihood under that model, for a given recombination rate and population size, using the msprime.log_arg_likelihood() method. Note however, that the simulation was run with the default ploidy level of 2, so that the msprime.sim_ancestry() method assumed the population_size parameter was the diploid population size. The log_arg_likelihood method requires Ne, the haploid population size, which is twice as large, so the likelihood is calculated as follows:

print(
    "Log likelihood of the genealogy under the Hudson model:",
    msprime.log_arg_likelihood(
        ts_arg,
        recombination_rate=parameters["recombination_rate"],
        Ne=parameters["population_size"] * 2  # Number of *haploid* genomes
    )
)
Log likelihood of the genealogy under the Hudson model: -93.57791165409245

It is worth noting that we fully simplify the tree above, we remove all the unary nodes and therefore lose information about the timings of recombination and non-coalescent common ancestry, but we still keep the local trees intact:

ts = ts_arg.simplify()
ts.draw_svg(
    size=(600, 300),
    y_axis=True,
    node_labels={u: u for u in ts.samples()},
    mutation_labels={},
    style=".mut .sym {stroke: goldenrod}",
    y_ticks=[t*500 for t in range(4)]
)
_images/args_9_0.svg

Because of this loss of information, the ARG likelihood cannot be calculated from the simplified tree sequence. We can still, however, calculate the mutation likelihood (i.e. the likelihood of the observed pattern of mutations, given the genealogy) because the topology and branch lengths of the local trees remain unchanged after simplification:

print("Log likelihood of mutations given the genealogy:")
print(' "full" ARG:',  msprime.log_mutation_likelihood(ts_arg, mutation_rate=mu))
print(" simplified:", msprime.log_mutation_likelihood(ts, mutation_rate=mu))
Log likelihood of mutations given the genealogy:
 "full" ARG: -33.204860594400216
 simplified: -33.204860594400216

Recording all nodes is expensive#

Many extra nodes are required to store full information about ancestrally relevant recombination. In fact, as the sequence length increases, these non-coalescent nodes come to dominate the tree sequence (which is one reason they are not included by default). We can calculate the percentage of non-coalescent nodes by comparing a full ARG with its simplified version:

large_sim_parameters = parameters.copy()
large_sim_parameters["sequence_length"] *= 1000
ts_arg = msprime.sim_ancestry(**large_sim_parameters, record_full_arg=True)
ts = ts_arg.simplify()

print(
    "Non-coalescent nodes take up "
    f"{(1-ts.num_nodes/ts_arg.num_nodes) * 100:0.2f}% "
    f"of this {ts.sequence_length/1e6:g} megabase {ts.num_samples}-tip ARG"
)
Non-coalescent nodes take up 99.32% of this 10 megabase 6-tip ARG

This is one of the primary reasons that nodes which are never associated with coalescent events are excluded by default in simulation software such as msprime and SLiM.

Note

As well as ancestrally relevant nodes, the original (mathematical) ARG formulation by Griffiths (1991) includes recombination nodes that are not ancestral to the samples. This leads to a graph with an vastly larger number of nodes than even the ARGs simulated here, and using such structures for simulation or inference is therefore infeasible.

Working with the ARG#

Todo

Add extra content as per https://github.com/tskit-dev/tutorials/issues/43

Other software#

Todo

Show how ARGweaver output can be converted to tskit form.

Todo

Show how KwARG output can be converted to tskit form.

Todo

Implement conversion between the 2 RE node version and the 1 RE node version