Tskit for phylogenetics#

Tskit, the tree sequence toolkit, is an efficient library for very large evolutionary trees. It makes trees with millions of tips straightforward to store and analyse, as in the example below:

Hide code cell source

import tskit
import numpy as np

def collapse_tree_for_plot(tree, max_tips=20, style=None):
    """
    Return a condensed version of a single tree by collapsing large clades until
    at most `max_tips` labelled tips remain.

    The returned tuple is:
        (plot_tree, node_labels, style)

    where `plot_tree` is a tree suitable for plotting with `draw_svg`, `node_labels`
    labels the retained sample tips plus the collapsed clades, and `style` highlights
    the collapsed clades.

    This helper is intended for plotting a single large tree. It requires the tree to
    come from a tree sequence containing exactly one tree.
    """
    ts = tree.tree_sequence
    if ts.num_trees != 1:
        raise ValueError("collapse_tree_for_plot currently expects a single-tree tree sequence")
    if tree.num_roots != 1:
        raise ValueError("collapse_tree_for_plot currently expects a single-root tree")
    if max_tips < 2:
        raise ValueError("max_tips must be at least 2")

    postorder = list(tree.nodes(order="postorder"))
    counts = np.zeros(ts.num_nodes, dtype=int)
    for u in postorder:
        if tree.is_sample(u):
            counts[u] = 1
        else:
            counts[u] = sum(counts[v] for v in tree.children(u))

    displayed_tips = tree.num_samples()
    candidate_nodes = [
        u for u in postorder
        if counts[u] > 1 and tree.parent(u) != tskit.NULL
    ]
    collapsed_nodes = set()
    forbidden = set()

    while displayed_tips > max_tips:
        needed_reduction = displayed_tips - max_tips
        candidates = [u for u in candidate_nodes if u not in collapsed_nodes and u not in forbidden]
        if len(candidates) == 0:
            break

        # Prefer a clade that gets us close to the target without overshooting.
        acceptable = [u for u in candidates if counts[u] - 1 <= needed_reduction]
        if len(acceptable) > 0:
            u = max(acceptable, key=lambda x: counts[x])
        else:
            u = min(candidates, key=lambda x: counts[x])

        collapsed_nodes.add(u)
        displayed_tips -= counts[u] - 1

        # Do not collapse descendants or ancestors of an already-collapsed node.
        stack = list(tree.children(u))
        while len(stack) > 0:
            v = stack.pop()
            forbidden.add(v)
            stack.extend(tree.children(v))
        v = tree.parent(u)
        while v != tskit.NULL:
            forbidden.add(v)
            v = tree.parent(v)

    keep_nodes = []
    stack = list(tree.roots)
    while len(stack) > 0:
        u = stack.pop()
        if u in collapsed_nodes:
            keep_nodes.append(u)
        elif tree.is_sample(u):
            keep_nodes.append(u)
        else:
            stack.extend(tree.children(u))

    plot_ts = ts.simplify(sorted(keep_nodes), filter_nodes=False)
    plot_tree = plot_ts.first()

    node_labels = {}
    style = [style or ""]
    for u in plot_tree.nodes():
        if u in collapsed_nodes:
            node_labels[u] = f"{counts[u]:,} tips"
            scale = min(8.0, 1.0 + counts[u] / max_tips)
            style.append(
                f".n{u} > .sym {{clip-path: polygon(50% 0%, 100% 100%, 0% 100%); "
                f"transform: scale({scale}, 3.5)}}"
            )
            style.append(
                f".n{u} > .lab {{transform: translateY(20px); font-size: 12px}}"
            )
        elif plot_tree.is_sample(u):
            node_labels[u] = str(u)

    return plot_tree, node_labels, "".join(style)
%%time

num_tips = 1_000_000
big_tree = tskit.Tree.generate_comb(num_tips)
print("Tree sequence takes up", big_tree.tree_sequence.nbytes / 1024**2, "Mb")
print(f"Generating a 'comb' (pectinate) tree of {num_tips} tips took:")
Tree sequence takes up 129.7001838684082 Mb
Generating a 'comb' (pectinate) tree of 1000000 tips took:
CPU times: user 3.33 s, sys: 70.1 ms, total: 3.4 s
Wall time: 3.4 s

Plotting a tree with a million tips is impractical, but we can draw a summary (see the visualization tutorial for details):

plot_tree, labels, css = collapse_tree_for_plot(  # function from the top of this notebook
    big_tree, max_tips=20, style=".y-axis .tick .lab {font-size: 8pt}"
)
plot_tree.draw_svg(
    size=(1000, 400), node_labels=labels, style=css, time_scale="rank", y_axis=True
)
_images/90cc536b8bf5cbf3bf3ead400add09c773c921572465e889397a39679d215ef3.svg

Calculations on these large trees can be very efficient:

%%time

traversed_nodes = 0
for u in big_tree.nodes(order="postorder"):
    traversed_nodes += 1
print(f"Postorder traversal through {traversed_nodes} nodes took:")
Postorder traversal through 1999999 nodes took:
CPU times: user 263 ms, sys: 35 μs, total: 263 ms
Wall time: 262 ms
%%time

b1_index = big_tree.b1_index()
print(f"B1 balance index is {b1_index}. Calculation took:")
B1 balance index is 14.39272472286399. Calculation took:
CPU times: user 10.1 ms, sys: 0 ns, total: 10.1 ms
Wall time: 10.1 ms

We can also read trees efficiently, e.g. in Newick format:

Todo

Add example of fast reading of a newick file, once https://github.com/tskit-dev/tskit/issues/2187 is solved. E.g. we could say

For example, we can read in a newick format tree of XXX tips in XXX secs. In practice, this means when reading in a large tree you are mainly limited by disk access speeds

import tsconvert  # used for reading tree sequences from different formats

# example code reading in a large file, timed

# Read a smaller tree from a string (here we create a tree spanning 1000 genomic units)
ts = tsconvert.from_newick("(A:6,((B:1,C:1):2,(D:2,E:2):1):3);", span=1000)

The “succinct tree sequence” format used by tskit can also store mutations (and optionally a reference genome) along with the tree(s). This results in a single unified representation of large genomic datasets, storing trees, sequence data, and metadata in a single efficient structure. See Storing and accessing genetic data for examples.

As the name suggests, a tree sequence can also store and analyse a sequence of trees along a genome (i.e. a “phylogenetic network”). This is necessary to account for recombination between lineages, and may be important even when looking at species-level phylogenies due to the effects of hybridization and incomplete lineage sorting. An overview and links to further details are given at the end of this page.

Hints for phylogeneticists#

Unlike other phylogenetic libraries, tskit is designed to efficiently store not just single trees, but also sequences of correlated trees along a genome. This means that the library has some features not found in more standard phylogenetic libraries. Here we focus on the Python API, introducing seven tskit concepts that may be useful to those with a background in phylogenetics (each is linked to a separate section below):

  1. An evolutionary tree is always contained within a “tree sequence”. See Trees are always part of a tree sequence

  2. The basic elements of a tree are nodes and edges, referred to by integer IDs. See Integer node and edge IDs

  3. The Tree object in the Python API provides useful phylogenetic methods. See The Tree object

  4. Specific nodes in a tree (often the tips) are marked as “samples”, meaning they are known from data. See Sample nodes

  5. Nodes and edges have additional attributes, with arbitrary information stored in metadata. See Attributes and metadata

  6. All nodes must have a valid time, meaning trees are always directional (i.e. “rooted”). See Nodes must have times

  7. “Roots” in trees have a specific definition, and a single tree can consist of topologically independent clades (a “multiroot” tree). See Roots and multiroot trees

Trees are always part of a tree sequence#

In tskit, all trees are stored as a “tree sequence” of correlated trees. This extends naturally to multiple trees, such as those produced by hybridization. In the simplest case, however, the tree sequence can contain just a single tree. This tree can be obtained using the first() method.

Hide code cell source

# Make sensible node labels. It would be nice to roll this into tree.draw()
node_labels = {
    node_object.id: node_object.metadata["name"]
    for node_object in ts.nodes()
    if "name" in node_object.metadata
}
tree = ts.first()
tree.draw(node_labels=node_labels)  # or use draw_svg() for more options
_images/ce07e2be35f389568e77eb66b4527d9a52f8dfd45d44bc374d781940d937c484.svg

Often you will have the tree sequence stored in a variable, such as the ts variable used above. However, you can also obtain the tree sequence in which a tree is contained using the Tree.tree_sequence attribute:

tree.tree_sequence  # When output in a notebook, prints a summary of the tree sequence
Tree Sequence
Trees1
Sequence Length1 000
Time Unitsunknown
Sample Nodes5
Total Size942 Bytes
MetadataNo Metadata
Table Rows Size Has Metadata
Edges 8 264 Bytes
Individuals 0 24 Bytes
Migrations 0 8 Bytes
Mutations 0 16 Bytes
Nodes 9 511 Bytes
Populations 0 8 Bytes
Provenances 0 16 Bytes
Sites 0 16 Bytes
Provenance Timestamp Software Name Version Command Full record
To cite this software, please consult the citation manual: https://tskit.dev/citation/

Integer node and edge IDs#

The plot above labels nodes by name, but internally the tskit library relies heavily on integer IDs. Here’s the same tree with node IDs plotted instead:

tree = ts.first()
tree.draw_svg()
_images/ac3c47964cde2c47059dbbe79224bb343e50d73fb7bbb667c5053f58633ab057.svg

Nodes#

Each node in a tree sequence is allocated an integer ID from 0 to ts.num_nodes - 1 (IDs can be allocated in any order; often the tips are labelled starting from 0, but this is not guaranteed, and is not the case in the example above).

For efficiency reasons, tree traversal routines, as well as many other tskit methods, tend to return integer IDs. You can use these IDs to get specific information about the node and its position in the tree, for example

node_id = 4
parent_id = tree.parent(node_id)
child_ids = tree.children(node_id)
print("The parent of", node_id, "is", parent_id, "and its children are", child_ids)
# or get all parents as an array (where -1 means there is no parent)
print(f"The parents of nodes 0..{ts.num_nodes-1} are", tree.parent_array)
The parent of 4 is 2 and its children are (7, 8)
The parents of nodes 0..8 are [-1  0  0  2  2  3  3  4  4 -1]

Other methods also exist to examine nodes in a tree, e.g. Tree.is_leaf(), Tree.mrca() for the most recent common ancestor between two or more nodes, etc.

Edges#

Rather than refer to “branches” of a tree, tskit tends to refer to Edges (the term “edge” emphasises that these can span Multiple trees, although for tree sequences containing a single tree, the terms are interchangeable). Like other entities in tskit, edges are referred to by an integer ID. For instance, here is the edge above the internal node 4:

node_id = 4
edge_id = tree.edge(node_id)
print("The edge above", node_id, "has ID", edge_id)
print(tree.tree_sequence.edge(edge_id))
The edge above 4 has ID 5
Edge(left=0.0, right=1000.0, parent=2, child=4, metadata=b'', id=5)

The left and right attributes of an edge give genomic coordinates, and are important in tree sequences that contain more than one tree.

The Tree object#

The Tree object has methods to perform basic operations on a tree, such as traversing nodes and identifying parents, children, and common ancestors. Several methods also return numpy arrays for use in efficient algorithms using numba.

for n_id in tree.nodes(order="postorder"):
    # you can also use "preorder", "levelorder", "timeasc", etc.
    print(n_id)
# Or get all of them as arrays
print("Node IDs in postorder:", tree.postorder())
1
5
6
3
7
8
4
2
0
Node IDs in postorder: [1 5 6 3 7 8 4 2 0]

Various phylogenetic statistics are also available on trees, for example:

print(f"The Colless imbalance index is {tree.colless_index()}")
The Colless imbalance index is 3

See Phylogenetic methods for more examples.

Sample nodes#

Often, we only have detailed information about specific nodes that we have sampled, such as genomes A, B, C, D, and E in the example above. These are designated as sample nodes, and are plotted as square nodes. The concept of sample nodes is integral to the tskit format. They can be identified by using the Node.is_sample() and Tree.is_sample() methods, or can be listed using TreeSequence.samples() or Tree.samples() (internally, the node.flags field is used to record which nodes are samples):

for n_id in tree.nodes():
    n_is_sample = tree.is_sample(n_id)
    print(f"Node {n_id} {'is' if n_is_sample else 'is not'} a sample node")

print("Sample nodes are", tree.tree_sequence.samples())
Node 0 is not a sample node
Node 1 is a sample node
Node 2 is not a sample node
Node 3 is not a sample node
Node 5 is a sample node
Node 6 is a sample node
Node 4 is not a sample node
Node 7 is a sample node
Node 8 is a sample node
Sample nodes are [1 5 6 7 8]

Often the sample nodes are the leaves of a tree, but this need not be the case. There are fast methods for identifying the sample nodes below an internal node in the tree, etc.

Attributes and metadata#

Given a node ID, you can access more information about the node from a node object. Because nodes are shared across all trees in a tree sequence, you access the node object via the tree_sequence to which this tree belongs:

tree.tree_sequence.node(node_id)  # or simply ts.node(node_id)
Node(id=4, flags=0, time=2.0, population=-1, individual=-1, metadata={})

Attributes such as id, flags, and time are always present. Arbitrary information, such as a name or bootstrap values, is stored in metadata.

for n_id in tree.nodes():
    print("Node", n_id, tree.tree_sequence.node(n_id).metadata.get("name", "<no name>"))
Node 0 <no name>
Node 1 A
Node 2 <no name>
Node 3 <no name>
Node 5 B
Node 6 C
Node 4 <no name>
Node 7 D
Node 8 E

However, for large datasets, it may be more efficient to access the time array for all nodes, which provides direct memory access into the tables that underlie the tree sequence format:

tree.tree_sequence.nodes_time
array([6., 0., 3., 1., 2., 0., 0., 0., 0.])

Nodes must have times#

Perhaps the most noticeable difference between a tskit tree and the encoding of trees in other phylogenetic libraries is that tskit does not explicitly store branch lengths. Instead, each node has a time associated with it. Branch lengths can therefore be found by calculating the difference between the time of a node and the time of its parent node.

Since nodes must have a time, tskit trees always have implicit branch lengths. To represent a tree (“cladogram”) in which the branch lengths are not meaningful, the TreeSequence.time_units of a tree sequence can be specified as "uncalibrated" (see below).

Another implication of storing node times rather than branch lengths is that tskit trees are always directional (i.e. they are “rooted”). The reason that tskit stores node times (rather than e.g. genetic distances between them) is to ensure temporal consistency. In particular, it makes it impossible for one node to be an ancestor of another node in one tree, and a descendant of that same node in another tree in the tree sequence. This is of critical importance when extending the concept of genetic ancestry to Multiple trees along a genome.

The TreeSequence.time_units attribute stores the units in which time is measured: if not known, this defaults to “unknown”:

print("Time units are", tree.tree_sequence.time_units)
tree.draw_svg(y_axis=True)
Time units are unknown
_images/ddf2ef573d96e439d0a1a6c1199bb5fc37f2f89000ab5036380205be570962f4.svg

Although branch lengths are not stored explicitly, for convenience tskit provides a Tree.branch_length() method:

print(
    "The edge (i.e. branch) immediately above node",
    node_id,
    "has a 'length' of",
    tree.branch_length(node_id),
)
The edge (i.e. branch) immediately above node 4 has a 'length' of 1.0

The branch distance between two samples is also easy to calculate using distance_between():

target_node_1 = 5
target_node_2 = 7
print(
    "The branch distance between nodes",
    target_node_1,
    "and",
    target_node_2,
    "is",
    tree.distance_between(target_node_1, target_node_2),  # beware: tree.path_length counts edges
)
The branch distance between nodes 5 and 7 is 6.0

It is worth noting that this distance is the basis for the “genetic divergence” between two samples in a tree. For this reason, an equivalent way to carry out the calculation is to use TreeSequence.divergence(), part of the standard tskit Statistics framework, setting mode="branch" and windows="trees". This is a more flexible approach, as it allows the distance between multiple sets of samples in Multiple trees to be calculated efficiently:

target_node_1 = 5
target_node_2 = 7
print(
    "Branch distance using built-in stats framework:",
    tree.tree_sequence.divergence(([5], [7]), mode="branch", windows="trees")
)
Branch distance using built-in stats framework: [6.]

Roots and multiroot trees#

In tskit, tree roots are defined with respect to the sample nodes. In particular, if we move back in time along the tree branches from a sample, the oldest node that we encounter is defined as a root. The ID of a root can be obtained using Tree.root:

print("The root node of the following tree has ID", tree.root)
tree.draw_svg()
The root node of the following tree has ID 0
_images/ac3c47964cde2c47059dbbe79224bb343e50d73fb7bbb667c5053f58633ab057.svg

We can also create a single “tree” consisting of multiple unlinked clades. In our example, we can create one of these phylogenetically unusual objects by removing the edge above node 4 via editing the underlying tables:

# Trees & tree seqs are immutable: to change them, modify a copy of the underlying tables
tables = ts.dump_tables()
keep_edge = tables.edges.child != 4
tables.edges.replace_with(tables.edges[keep_edge])
new_ts = tables.tree_sequence()  # Turn the edited tables back into a tree sequence
new_tree = new_ts.first()
new_tree.draw_svg()
_images/25a7d82a4ad7075dc0e62037fecd3ac56e1e8f7e1cfb1edb540e24ccad1cdc4c.svg

Although there are two separate topologies in this plot, in tskit terminology this is considered a single tree with two roots:

print("The first tree has", len(new_tree.roots), "roots:", new_tree.roots)
The first tree has 2 roots: [4, 0]

This also means that if we have no topology at all (i.e. an “empty tree”), each sample is its own root:

tables.edges.clear()
erased_ts = tables.tree_sequence()
empty_tree = erased_ts.first()
print("This empty tree has", len(empty_tree.roots), "roots:", empty_tree.roots)
empty_tree.draw_svg()
This empty tree has 5 roots: [1, 5, 6, 7, 8]
_images/ca2866442da84ae2ac104a2ae54db73d1824cb47c23a7274da3afa4303160486.svg

The samples here are isolated nodes. This may seem like a strange corner case, but in tskit, isolated sample nodes are used to represent Missing data. The empty tree therefore represents a case in which relationships between the samples are not known. This could apply, for instance, in regions of the genome where no genetic data exists, or where genetic ancestry has not been simulated.

Phylogenetic methods#

Todo

Demo some phylogenetic methods. e.g.

  1. Total branch length - demo quick calculation across multiple trees - incremental algorithm used extensively in population genetics. (“bringing tree thinking to popgen”).

  2. KC distance

  3. Balance metrics

  4. Topology rankings (see https://github.com/tskit-dev/tutorials/issues/93)

Storing and accessing genetic data#

Tskit is designed to capture both evolutionary tree topologies and the genetic sequences that evolve along the branches of these trees. It does this by defining mutations and sites, which are associated with specific positions along the genome.

import msprime  # The `msprime` package can add mutations to a tree sequence
mutated_ts = msprime.sim_mutations(ts, rate=3e-3, random_seed=321)
mutated_tree = mutated_ts.first()
print("Variable sites with the following IDs were generated:")
for site in mutated_tree.sites():
    print(
        f"Site ID {site.id} @ genomic position {site.position:g}:",
        f"{site.ancestral_state} -> {site.mutations[0].derived_state}"
    )
mutated_tree.draw_svg()
Variable sites with the following IDs were generated:
Site ID 0 @ genomic position 47: G -> C
Site ID 1 @ genomic position 77: C -> G
Site ID 2 @ genomic position 87: G -> C
Site ID 3 @ genomic position 91: T -> A
Site ID 4 @ genomic position 115: G -> A
Site ID 5 @ genomic position 116: G -> A
Site ID 6 @ genomic position 117: T -> A
Site ID 7 @ genomic position 126: C -> A
Site ID 8 @ genomic position 130: G -> T
Site ID 9 @ genomic position 141: G -> T
Site ID 10 @ genomic position 143: T -> A
Site ID 11 @ genomic position 203: A -> G
Site ID 12 @ genomic position 230: A -> G
Site ID 13 @ genomic position 247: G -> A
Site ID 14 @ genomic position 272: A -> G
Site ID 15 @ genomic position 317: G -> A
Site ID 16 @ genomic position 335: C -> G
Site ID 17 @ genomic position 355: G -> C
Site ID 18 @ genomic position 364: T -> G
Site ID 19 @ genomic position 382: G -> T
Site ID 20 @ genomic position 426: T -> A
Site ID 21 @ genomic position 430: C -> T
Site ID 22 @ genomic position 433: A -> G
Site ID 23 @ genomic position 489: C -> A
Site ID 24 @ genomic position 516: C -> T
Site ID 25 @ genomic position 523: G -> A
Site ID 26 @ genomic position 540: C -> G
Site ID 27 @ genomic position 623: A -> T
Site ID 28 @ genomic position 671: C -> A
Site ID 29 @ genomic position 711: G -> T
Site ID 30 @ genomic position 737: A -> C
Site ID 31 @ genomic position 742: T -> G
Site ID 32 @ genomic position 752: G -> A
Site ID 33 @ genomic position 800: G -> A
Site ID 34 @ genomic position 803: T -> C
Site ID 35 @ genomic position 819: A -> G
Site ID 36 @ genomic position 842: C -> G
Site ID 37 @ genomic position 849: A -> G
Site ID 38 @ genomic position 861: A -> T
Site ID 39 @ genomic position 913: C -> A
Site ID 40 @ genomic position 922: G -> A
Site ID 41 @ genomic position 965: G -> A
Site ID 42 @ genomic position 975: A -> G
Site ID 43 @ genomic position 979: G -> A
Site ID 44 @ genomic position 988: A -> C
Site ID 45 @ genomic position 997: T -> G
_images/67923050d3893d016e1dd3001311e1508af158f927c9e901d5bdfa63048296ff.svg

Mutations occur above nodes in a tree, and all descendant nodes inherit that specific mutation (unless it is replaced by a subsequent mutation at the same site). This allows genetic variation to be efficiently represented using the tree topology. To obtain the genetic variation at each site across the entire genome, you can use the TreeSequence.sites() method, or (less efficiently), you can use TreeSequence.alignments() to output the entire sequences for each sample node:

for node_id, alignment in zip(
    mutated_ts.samples(),
    mutated_ts.alignments(missing_data_character="."),
):
    print(f"Node {node_id}: {alignment}")
Node 1: ...............................................C.............................C.........C...T.......................GGT........C...G..........T.T...........................................................G..........................G................G........................A............................................G.................C...................A........T.................G...........................................T...T..G.......................................................C..........................T......G................C..................................................................................T...............................................C.......................................G.........................A....A.........G...............................................A..T...............A......................C......C...........A...................................................C........A..........................................G.........A...G........A........G..
Node 5: ...............................................G.............................C.........G...A.......................GGT........C...T..........G.T...........................................................A..........................A................A........................A............................................G.................C...................G........T.................G...........................................T...C..A.......................................................C..........................C......G................C..................................................................................A...............................................A.......................................T.........................C....T.........A...............................................G..T...............G......................C......G...........A...................................................C........G..........................................G.........G...A........A........T..
Node 6: ...............................................G.............................C.........G...A.......................GGA........C...T..........G.T...........................................................A..........................A................G........................A............................................A.................G...................G........T.................G...........................................T...C..A.......................................................C..........................C......G................G..................................................................................A...............................................A.......................................T.........................C....T.........G...............................................G..T...............G......................C......G...........A...................................................A........G..........................................A.........G...G........A........T..
Node 7: ...............................................G.............................G.........G...A.......................GAT........C...T..........G.A...........................................................A..........................A................G........................A............................................G.................C...................G........G.................T...........................................A...C..A.......................................................A..........................C......G................C..................................................................................A...............................................C.......................................T.........................A....T.........G...............................................G..T...............G......................C......G...........T...................................................C........G..........................................G.........A...G........C........T..
Node 8: ...............................................G.............................G.........G...A.......................AAT........A...T..........G.A...........................................................A..........................A................G........................G............................................G.................C...................G........T.................T...........................................T...C..A.......................................................C..........................C......A................C..................................................................................A...............................................C.......................................T.........................A....G.........G...............................................G..C...............G......................G......G...........A...................................................C........G..........................................G.........A...G........A........T..

Multiple trees#

Tskit is particularly useful when the ancestry of your dataset cannot be adequately represented by a single tree. This is a pervasive issue in genomes (even from different species) that have undergone recombination in the past. The resulting series of local trees along a genome are highly correlated (see Concepts).

Instead of storing each tree along a genome separately, tskit records the genomic coordinates of each edge, which leads to enormous efficiencies in storage and analysis. As a basic demonstration, we can repeat the edge removal example above, but only remove the ancestral link above node 4 for the first half of the genome.

tables = ts.dump_tables()
edge_id_above_node_4 = ts.first().edge(4)
left_coord_for_edges = tables.edges.left
left_coord_for_edges[edge_id_above_node_4] = 50
tables.edges.left = left_coord_for_edges  # apply the modified left coordinates
tables.sort()
multi_ts = tables.tree_sequence()

multi_ts.draw_svg()
_images/bdcc4912a3d7a23a6b171a9255efa437f2f6dcccaeeaf5e39691e91e36234e24.svg

For the left-hand side of the genome, we lack information about the ancestry of node 4, but for the right-hand side, we know this information. This generates two trees in the tree sequence, which differ only in the presence or absence of a single branch. We do not have to store the entire tree on the right separately: all the edges that are shared between trees are stored only once.

The rest of the tskit tutorials will lead you through the concepts involved with storing and analysing sequences of many correlated trees. For a simple introduction, you might want to start with What is a tree sequence?.