Tables and editing

Internally, a tree sequence can be thought of as a set of tables, and tskit provides an interface for dealing with these tables directly. This is particularly relevant when you wish to edit or otherwise modify a tree sequence, although tables are also useful for bulk access to data contained in the tree sequence, such as the times of all nodes.

There are eight tables that together define a tree sequence, although some may be empty, and together they form a TableCollection. The tables are defined in the official tskit documentation for Table Definitions, and the Tables API section in the docs describes how to work with them. In this tutorial we give some pointers about what you can and cannot do with them.

Correspondence between tables and trees

Consider the following sequence of trees:

from IPython.display import SVG
ts = tskit.load("data/tables_example.trees")
SVG(ts.draw_svg(y_axis=True))
_images/tables_and_editing_3_0.svg

Ancestral recombination events have produced three different trees that relate the three sampled genomes 0, 1, and 2 to each other along the chromosome of length 100.

Node and edge tables

Each node in each of the above trees represents a particular ancestral genome (a haploid genome; diploid individuals would be represented by two nodes). Details about each node, including the time it lived, are stored in a NodeTable (details here)

ts.tables.nodes
idflagspopulationindividualtimemetadata
01-1-10.00000000b''
11-1-10.00000000b''
21-1-10.00000000b''
30-1-10.15000000b''
40-1-10.60000000b''
50-1-10.80000000b''
60-1-11.00000000b''

Importantly, the first column, id, is not actually recorded, and is only shown when printing out node tables (as here) for convenience. The second column, flags records a 1 for the individuals that are samples, i.e., whose entire genealogical history is recorded by these trees. (Note that the trees above record that node 3 inherited from node 4 on the middle portion of the genome, but not on the ends.)

The way the nodes are related to each other (i.e. the tree topology) is stored in an EdgeTable (details here). Since some edges are present in more than one tree (e.g., node 1 inherits from node 4 across the entire sequence), each edge contains not only the IDs of a parent and a child node, but also the left and right positions which define the genomic region for which it appears in the trees:

ts.tables.edges
idleftrightparentchildmetadata
020.0000000080.0000000030b''
120.0000000080.0000000032b''
20.00000000100.0000000041b''
30.0000000020.0000000042b''
480.00000000100.0000000042b''
520.0000000080.0000000043b''
680.00000000100.0000000050b''
780.00000000100.0000000054b''
80.0000000020.0000000060b''
90.0000000020.0000000064b''

Since node 3 is most recent, the edge that says that nodes 0 and 2 inherit from node 3 on the interval between 0.2 and 0.8 comes first. Next are the edges from node 4: there are four of these, as the edge from node 4 to node 1 is shared across the entire sequence, and for each of the three genomic intervals there is an additional child node. At this point, we know the full tree on the middle interval. Finally, edges specifying the common ancestor of 0 and 4 on the remaining intervals (parents 6 and 5 respectively) allow us to construct all trees across the entire interval.

Site and mutation tables

Most tree sequences have DNA variation data associated with them, stored as mutations overlaid on the trees:

ts = tskit.load("data/tables_example_muts.trees")
mut_labels = {}  # An array of labels for the mutations, listing position & allele change
for mut in ts.mutations():  # This entire loop is just to make pretty labels
    site = ts.site(mut.site)
    older_mut = mut.parent >= 0  # is there an older mutation at the same position?
    prev = ts.mutation(mut.parent).derived_state if older_mut else site.ancestral_state
    mut_labels[mut.id] = "{}{} @{:g}".format(prev, mut.derived_state, site.position)

SVG(ts.draw_svg(y_axis=True, mutation_labels=mut_labels))
_images/tables_and_editing_9_0.svg

There are four mutations in the depiction above, marked by red crosses: one above node 4 on the first tree which records an A to G transition at position 15, another above node 1 in the second tree which records a G to A transition at position 45, and the final two above nodes 2 and 0 recording transitions, both at position 60, on the second tree. The positions are recorded in the SiteTable (details here):

ts.tables.sites
idpositionancestral_statemetadata
015.00000000Ab''
142.00000000Gb''
260.00000000Tb''

As with node tables, the id column is not actually recorded, but is implied by the position in the table. The mutations themselves are recorded in the MutationTable (details here). This associates each mutation with a site ID, the ID of the node above which the mutation occurs, the derived state to which the allele has mutated, and (optionally) a time at which the mutation occured.

ts.tables.mutations
idsitenodetimederived_stateparentmetadata
0040.90000000G-1b''
1110.40000000A-1b''
2230.55000000C-1b''
3200.10000000T2b''

Where there are multiple mutations at the same site, there can be mutations stacked on top of each other. The “parent” column therefore contains the ID of the mutation immediately above the current one at the same site, or -1 if there is no parent.

These sites and mutations allow us to calculate the DNA sequence, or haplotype, for each of the sample nodes:

for sample, h in enumerate(ts.haplotypes()):
    print(f"Sample {sample}: {h}")
Sample 0: AGT
Sample 1: GAT
Sample 2: GGC

Tables overview

Now that we’ve introduced the most important tables that make up a tree sequence, it’s worth seeing how they come together to form a TableCollection. The following visualization shows how they relate to one another, as well as introducing two additional tables: the IndividualTable and PopulationTable (which record which individuals and populations a node is contained in).

The tables in a tree sequence, and their interdependencies

The lines in this diagram below show where one table refers to the IDs of another table, that is, they indicate dependencies between tables.

Note

For simplicity, this diagram omits 2 extra tables which form part of a TableCollection: the MigrationTable (recording details of migrations between populations), and the ProvenanceTable (recording the provenance of the data in a tree sequence)

Metadata

You may have noticed metadata columns in all the tables above. All tables can have (optional) metadata attached to their rows, as described in the Metadata section of the official tskit documentation. In addition, you can assign top-level metadata, associated with the entire TableCollection. The Working with Metadata tutorial provides more information about how to set and use these various forms of metadata.

Accessing table data

To look at a the contents of a table, you can simply print it:

print(ts.tables.mutations)
╔══╤════╤════╤══════════╤═════════════╤══════╤════════╗
║id│site│node│time      │derived_state│parent│metadata║
╠══╪════╪════╪══════════╪═════════════╪══════╪════════╣
║0 │   0│   4│0.90000000│            G│    -1│     b''║
║1 │   1│   1│0.40000000│            A│    -1│     b''║
║2 │   2│   3│0.55000000│            C│    -1│     b''║
║3 │   2│   0│0.10000000│            T│     2│     b''║
╚══╧════╧════╧══════════╧═════════════╧══════╧════════╝

But tskit also provides access to the rows and columns of each table.

Row access

Rows can be accessed using square braces, which will return an object containing the raw values:

row = ts.tables.mutations[1]
print(f"A mutation at site {row.site} exists at time {row.time}")
print(row)
A mutation at site 1 exists at time 0.4
MutationTableRow(site=1, node=1, derived_state='A', parent=-1, metadata=b'', time=0.4)

Additionally, many rows can be extracted into a new table using slices or numpy indexing

ts.tables.mutations[2:4]
idsitenodetimederived_stateparentmetadata
0230.55000000C-1b''
1200.10000000T2b''

Note

When manipulating table data, it is quite possible to create a table collection that does not correspond to a valid tree sequence. For instance, if we replaced the mutations table in our original tree sequence with the table above, the parent column would refer to an invalid mutation ID (ID 2, when in the new tables we only have mutations with IDs 0 and 1). In this particular case there is a method, TableCollection.compute_mutation_parents(), which will recalculate the parent IDs from the trees, but there is no general way to ensure that a manipulated table will remain valid.

Rows can also be added to a table using .add_row(). However, when modifying tables from a tree sequence, you should always modify a copy of the original data, using the TreeSequence.dump_tables() method:

new_tables = ts.dump_tables()
new_pos = 10
new_site_id = new_tables.sites.add_row(position=new_pos, ancestral_state="G")
print("New empty site allocated at position {new_pos} with ID {new_site_id}")
new_tables.sites
New empty site allocated at position {new_pos} with ID {new_site_id}
idpositionancestral_statemetadata
015.00000000Ab''
142.00000000Gb''
260.00000000Tb''
310.00000000Gb''

Note that one of the requirements of a tree sequence is that the sites are listed in order of position, so this new table will require sorting (see Constructing a tree sequence)

Column access

An entire column in a table can be extracted as a numpy array from the table object. For instance, if n is a NodeTable, then n.time will return an array containing the birth times of the individuals whose genomes are represented by the nodes in the table. However, it is important to note that this is a copy of the data, and modifying individual elements of n.time will not change the node table n. To change the column data, you need to take a copy, modify it, and assign it back in. For example, here we add 0.25 to every time except the first in the node table from our current example:

node_table = new_tables.nodes
times = node_table.time
print("Old node times:", times)
times[1:] = times[1:] + 0.25
node_table.time = times
node_table
Old node times: [0.   0.   0.   0.15 0.6  0.8  1.  ]
idflagspopulationindividualtimemetadata
01-1-10.00000000b''
11-1-10.25000000b''
21-1-10.25000000b''
30-1-10.40000000b''
40-1-10.85000000b''
50-1-11.05000000b''
60-1-11.25000000b''

When assigning columns like this, an error will be raised if the column is not of the expected length:

node_table.time = times[2:]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_4668/1022932079.py in <module>
----> 1 node_table.time = times[2:]

/opt/hostedtoolcache/Python/3.8.11/x64/lib/python3.8/site-packages/tskit/tables.py in __setattr__(self, name, value)
    331             d = self.asdict()
    332             d[name] = value
--> 333             self.set_columns(**d)
    334         else:
    335             object.__setattr__(self, name, value)

/opt/hostedtoolcache/Python/3.8.11/x64/lib/python3.8/site-packages/tskit/tables.py in set_columns(self, flags, time, population, individual, metadata, metadata_offset, metadata_schema)
    949         """
    950         self._check_required_args(flags=flags, time=time)
--> 951         self.ll_table.set_columns(
    952             dict(
    953                 flags=flags,

ValueError: Input array dimensions must be equal.

Turning tables into a tree sequence

The TableCollection.tree_sequence() method will attempt to turn a table collection into a tree sequence. This is not guaranteed to work: it’s possible you have created a nonsensical tree sequence where, for example, a child node has multiple parents at a given position, or where edges reference non-existant nodes. The Valid tree sequence requirements section of the official tskit docs lists the requirements for a TableCollection to represent a valid TreeSequence.

Sorting

Even if the tables represent a valid tree sequence, for efficiency reasons tree sequences also require several of their constituent tables to be sorted in a specific order. For instance, edges are required to be sorted in nondecreasing order of parent time, and sites are required to be sorted in order of position. We can ensure that a set of tables are correctly sorted by calling the TableCollection.sort() method.

For instance, the new_tables collection we created above no longer has its sites in the correct order, so we need to call new_tables.sort() before turning it into a tree sequence:

# new_ts = new_tables.tree_sequence()  # This won't work
new_tables.sort()
new_ts = new_tables.tree_sequence()  # Now it will

We can now plot the resulting tree sequence:

# Plot without mutation labels, for clarity 
SVG(new_ts.draw_svg(y_axis=True, y_gridlines=True, mutation_labels={}))
_images/tables_and_editing_31_0.svg

You can see that the new tree sequence has been modified as expected: there is a new empty site at position 10 (represented by a tickmark above the X axis with no mutations on it), and all the nodes except node 0 have had their times increased by 0.25.

Constructing a tree sequence

With the tools above in mind, we can now see how to construct a tree sequence by hand. It’s unlikely that you would ever need to do this from scratch, but it’s helpful to understand how tables work. We’ll build the simplest possible tree sequence, a single tree that looks like this:

SVG(tskit.load("data/construction_example.trees").draw_svg(y_axis=True))
_images/tables_and_editing_33_0.svg

Starting with an empty set of tables, we can fill, say, the node information by using NodeTable.add_row() as follows:

tables = tskit.TableCollection(sequence_length=1e3)
node_table = tables.nodes  # set up an alias, for efficiency
node_table.add_row(flags=tskit.NODE_IS_SAMPLE)  # Node 0: defaults to time 0
node_table.add_row(flags=tskit.NODE_IS_SAMPLE)  # Node 1: defaults to time 0
node_table.add_row(time=10)  # Node 2: not a sample
node_table.add_row(flags=tskit.NODE_IS_SAMPLE)  # Node 3
node_table.add_row(time=20)  # Node 4
node_table
idflagspopulationindividualtimemetadata
01-1-10.00000000b''
11-1-10.00000000b''
20-1-110.00000000b''
31-1-10.00000000b''
40-1-120.00000000b''

The .add_row() method is natural (and should be reasonably efficient) if new records appear one-by-one. Alternatively .set_columns() can be used to set columns for all the rows at once, by passing in numpy arrays of the appropriate type. We’ll use that for the edges:

import numpy as np
edge_table = tables.edges
edge_table.set_columns(
    left=np.array([0.0, 0.0, 0.0, 0.0]),
    right=np.array([1e3, 1e3, 1e3, 1e3]),
    parent=np.array([2, 2, 4, 4], dtype=np.int32),  # References IDs in the node table
    child=np.array([0, 1, 2, 3], dtype=np.int32),  # References IDs in the node table
)
edge_table
idleftrightparentchildmetadata
00.000000001000.0000000020b''
10.000000001000.0000000021b''
20.000000001000.0000000042b''
30.000000001000.0000000043b''

And finally we can add a site and a mutation: here we’ll use 0/1 mutations rather than ATGC.

site_id = tables.sites.add_row(position=500.0, ancestral_state='0')
tables.mutations.add_row(site=site_id, node=2, derived_state='1')
ts = tables.tree_sequence()
print("A hand-built tree sequence!")
SVG(ts.draw_svg(y_axis=True))
A hand-built tree sequence!
_images/tables_and_editing_39_1.svg

Note

The ordering requirements for a valid tree sequence do not specify that rows in the node table have to be sorted in any particular order, e.g. by time. By convention, sample nodes are often the first ones listed in the node table, and this is the node order returned by simplify(), but the example above shows that sample nodes need not necessarily be those with the IDs \(0..n\).

Editing tree sequences

Sometimes you may wish to make modifications to a tree sequence that has been generated by a simulation. However, tree sequence objects are immutable and so cannot be edited in-place.

Before describing how to edit a tree sequence, it’s worth noting that tskit may provide a built-in method to achieve what you want to do, without worring about fiddling with the underlying tables by hand. In particular, several methods will return a new tree sequence that has been modified in some way. For example:

However, if you want to do something not covered by those built-in methods, you will need to edit a copy of the underlying tables and then create a new tree sequence from the modified tables. In the following example, we use this approach to remove all singleton sites from a given tree sequence.

def strip_singletons(ts):
    tables = ts.dump_tables()
    tables.sites.clear()
    tables.mutations.clear()
    for tree in ts.trees():
        for site in tree.sites():
            assert len(site.mutations) == 1  # Only supports infinite sites muts.
            mut = site.mutations[0]
            if tree.num_samples(mut.node) > 1:
                site_id = tables.sites.append(site)
                mut = mut.replace(site=site_id)  # set the new site id
                tables.mutations.append(mut)
    return tables.tree_sequence()

This function takes a tree sequence containing some infinite sites mutations as input, and returns a copy in which all singleton sites have been removed. The approach is very simple: we get a copy of the underlying table data in a TableCollection object, and first clear the site and mutation tables. We then consider each site in turn, and if the number of samples with the mutation is greater than one, we add the site and mutation to our output tables using SiteTable.append() and MutationTable.append(). These functions act exactly like SiteTable.add_row() and MutationTable.add_row() but they take an existing site or mutation and add all its details as a new row.

After considering each site, we then create a new tree sequence using the TableCollection.tree_sequence() method on our updated tables. We can test this on a tree sequence that has been mutated using msprime.sim_mutations() with the discrete_genome parameter set to False, which creates infinite sites mutations:

import msprime

ts = msprime.sim_ancestry(10, random_seed=123)
ts = msprime.sim_mutations(ts, rate=10, discrete_genome=False, random_seed=123)
print(ts.num_sites, "sites in the simulated tree sequence")

ts_new = strip_singletons(ts)
print(ts_new.num_sites, "sites after removing singletons")
109 sites in the simulated tree sequence
57 sites after removing singletons

Todo

Add another example here where we use the array oriented API to edit the nodes and edges of a tree sequence. Perhaps recapitating would be a good example?