Tables and editing#

Internally, a tree sequence can be thought of as a set of tables, and tskit provides an interface for dealing with these tables directly. This is particularly relevant when you wish to edit or otherwise modify a tree sequence, although tables are also useful for bulk access to data contained in the tree sequence, such as the times of all nodes.

There are eight tables that together define a tree sequence, although some may be empty, and together they form a TableCollection. The tables are defined in the official tskit documentation for Table Definitions, and the Tables API section in the docs describes how to work with them. In this tutorial we give some pointers about what you can and cannot do with them.

Correspondence between tables and trees#

Consider the following sequence of trees:

ts = tskit.load("data/tables_example.trees")
Hide code cell source
ts.draw_svg(y_axis=True)
_images/0327585c23c21d289094cc6394cc71ecc7e43f14197d7961b9d759c5abcc0e29.svg

Ancestral recombination events have produced three different trees that relate the three sampled genomes 0, 1, and 2 to each other along the chromosome of length 100.

Node and edge tables#

Each node in each of the above trees represents a particular ancestral genome (a haploid genome; diploid individuals would be represented by two nodes). Details about each node, including the time it lived, are stored in a NodeTable (details here)

ts.tables.nodes
idflagspopulationindividualtimemetadata
01000.00000000
11000.00000000
21010.00000000
30020.15000000
40-1-10.60000000
50-1-10.80000000
60-1-11.00000000

Importantly, the first column, id, is not actually recorded, and is only shown when printing out node tables (as here) for convenience. The second column, flags records a 1 for the nodes that are samples, i.e., whose entire genealogical history is recorded by these trees. You can see that the sample nodes are present in all the trees, but other nodes need not be: for example node 3 is only seen in the middle tree. The individual and population columns store an index into the rows of the Individual and population tables, or tskit.NULL (-1) if the node is not associated with an individual or population. The time column records the historical age of the node, with contemporary nodes having a time of 0.

The way the nodes are related to each other (i.e. the tree topology) is stored in an EdgeTable (details here). Since some edges are present in more than one tree (e.g., node 1 inherits from node 4 across the entire sequence), each edge contains not only the IDs of a parent and a child node, but also the left and right positions which define the genomic region for which it appears in the trees:

ts.tables.edges
idleftrightparentchildmetadata
0208030
1208032
2010041
302042
48010042
5208043
68010050
78010054
802060
902064

Since node 3 is most recent, the edge that says that nodes 0 and 2 inherit from node 3 on the interval between 0.2 and 0.8 comes first. Next are the edges from node 4: there are four of these, as the edge from node 4 to node 1 is shared across the entire sequence, and for each of the three genomic intervals there is an additional child node. At this point, we know the full tree on the middle interval. Finally, edges specifying the common ancestor of 0 and 4 on the remaining intervals (parents 6 and 5 respectively) allow us to construct all trees across the entire interval.

Site and mutation tables#

Most tree sequences have DNA variation data associated with them, stored as mutations overlaid on the trees:

ts = tskit.load("data/tables_example_muts.trees")
Hide code cell source
mut_labels = {}  # An array of labels for the mutations, listing position & allele change
for mut in ts.mutations():  # This entire loop is just to make pretty labels
    site = ts.site(mut.site)
    older_mut = mut.parent >= 0  # is there an older mutation at the same position?
    prev = ts.mutation(mut.parent).derived_state if older_mut else site.ancestral_state
    mut_labels[mut.id] = "{}{} @{:g}".format(prev, mut.derived_state, site.position)
ts.draw_svg(y_axis=True, mutation_labels=mut_labels)
_images/4a9eee4906ad34d1d4026f2fdc0fc571674bee2f5912f29153c28221be7ed3a3.svg

There are four mutations in the depiction above, marked by red crosses: one above node 4 on the first tree which records an A to G transition at position 15, another above node 1 in the second tree which records a G to A transition at position 42, and the final two above nodes 0 and 3 recording transitions, both at position 60, on the second tree. The positions are recorded in the SiteTable (details here):

ts.tables.sites
idpositionancestral_statemetadata
015A
142G
260T

As with node tables, the id column is not actually recorded, but is implied by the position in the table. The mutations themselves are recorded in the MutationTable (details here). This associates each mutation with a site ID, the ID of the node above which the mutation occurs, the derived state to which the allele has mutated, and (optionally) a time at which the mutation occured.

ts.tables.mutations
idsitenodetimederived_stateparentmetadata
0040.90000000G-1
1110.40000000A-1
2230.55000000C-1
3200.10000000T2

Where there are multiple mutations at the same site, there can be mutations stacked on top of each other. The “parent” column therefore contains the ID of the mutation immediately above the current one at the same site, or -1 if there is no parent.

These sites and mutations allow us to calculate the DNA sequence, or haplotype, for each of the sample nodes:

for sample, h in enumerate(ts.haplotypes()):
    print(f"Sample {sample}: {h}")
Sample 0: AGT
Sample 1: GAT
Sample 2: GGC

Tables overview#

Now that we’ve introduced the most important tables that make up a tree sequence, it’s worth seeing how they come together to form a TableCollection.

The following visualization shows how they relate to one another, as well as introducing two additional tables: the IndividualTable and PopulationTable (which record which individuals and populations a node is contained in).

The tables in a tree sequence, and their interdependencies

The lines in this diagram below show where one table refers to the IDs of another table, that is, they indicate dependencies between tables.

Individual and population tables#

Previously we noted that a row in the node table could contain an index into the IndividualTable and the PopulationTable. In fact, in our example tree sequence, the node table above indicates that nodes 0 to 3 belong to the population with ID 0 in the population table. Here’s what that table looks like in this tree sequence:

ts.tables.populations
idmetadata
0{'name': 'GBR', 'description': 'British in England and Scotland'}

The data in this table, and indeed in the individual table, is not visible in the default tree sequence visualization, but can be very useful for keeping track of information about the identity and location of nodes in the tree sequence. In particular, you can see that the populations in this tree sequence have associated metadata, a topic we will discuss in the next section.

The node table also indicated that two nodes (node 0 and 1) point to the individual with ID 0, one node (node 2) points to individual 1, and another node points to individual 2. Here’s the relevant individual table:

ts.tables.individuals
idflagslocationparentsmetadata
002, -1{'name': 'Alice'}
102, -1{'name': 'Bob'}
20-1, -1{'name': 'Carol'}

Since individual 0 (Alice) is referenced by two nodes (genomes), and assuming that this is a diploid species, this indicates that we have sampled both maternal and paternal genomes for this individual. The other two indivividuals, Bob and Carol, are referenced by only one node, so either we only managed to sample one of their two genomes, or alternatively they could represent haploid individuals.

The location column of the individual table is intended for geographical location, if relevant. The parents column, if filled out, allows us to construct parts of the pedigree (the genealogical family tree) connecting the individuals in question. It contains a list of individual IDs (or tskit.NULL/-1 if a parent is unknown). In this case, the parents column indicates that Alice and Bob share the same parent, individual 2 (i.e. Carol). That means Alice and Bob must be either siblings or half siblings.

Metadata#

As can be seen above, each table row, as well as the entire table collection, can be associated with (optional) metadata, which can be stored in various formats. The format is determined by the schema of the metadata. In this example, for clarity, the metadata of the population and individual tables is stored as JSON.

Because we have chosen to use JSON for the metadata of the individuals table, it is easy to access the metadata values:

print("Information from the tree sequence")
for i in ts.individuals():
    print(f"Individual {i.id} is named {i.metadata['name']}")
print("\nEquivalent information from the associated tables")
for row in ts.tables.individuals:
    print(f"Individual name: {row.metadata['name']}")
Information from the tree sequence
Individual 0 is named Alice
Individual 1 is named Bob
Individual 2 is named Carol

Equivalent information from the associated tables
Individual name: Alice
Individual name: Bob
Individual name: Carol

Later in this tutorial there are some basic examples of editing the metadata fields; there is also an extensive Working with Metadata tutorial which provides much more detail about how to construct and use tskit metadata.

Accessing table data#

A simple way to see the contents of a table is to print it:

print(ts.tables.nodes)
╔══╤═════╤══════════╤══════════╤══════════╤════════╗
║id│flags│population│individual│time      │metadata║
╠══╪═════╪══════════╪══════════╪══════════╪════════╣
║0 │    1│         0│         0│0.00000000│        ║
║1 │    1│         0│         0│0.00000000│        ║
║2 │    1│         0│         1│0.00000000│        ║
║3 │    0│         0│         2│0.15000000│        ║
║4 │    0│        -1│        -1│0.60000000│        ║
║5 │    0│        -1│        -1│0.80000000│        ║
║6 │    0│        -1│        -1│1.00000000│        ║
╚══╧═════╧══════════╧══════════╧══════════╧════════╝

But tskit also provides access to columns and rows separately.

Column access#

An entire column in a table can be extracted as a numpy array from the table object. For instance, if n is a NodeTable, then n.time will return an array containing the birth times of the individuals whose genomes are represented by the nodes in the table.

ts.tables.nodes.time
array([0.  , 0.  , 0.  , 0.15, 0.6 , 0.8 , 1.  ])

You can also modify entire columns, but when doing this to tables from a tree sequence, you should always modify a copy of the original data, using the TreeSequence.dump_tables() method:

new_tables = ts.dump_tables()  # make a copy of the tree sequence tables, for editing
new_tables.nodes.time
array([0.  , 0.  , 0.  , 0.15, 0.6 , 0.8 , 1.  ])

Moreover, it is important to note that this numpy array itself is copy of the data stored in new_tables. In other words, modifying individual elements of n.time will not change the node table n. To change the column data, you need to take a copy of the array, modify it, and assign it back in. For example, here we add 0.25 to every time except the first in the node table:

node_table = new_tables.nodes
times = node_table.time
print("Old node times:", times)
times[1:] = times[1:] + 0.25
node_table.time = times
node_table
Old node times: [0.   0.   0.   0.15 0.6  0.8  1.  ]
idflagspopulationindividualtimemetadata
01000.00000000
11000.25000000
21010.25000000
30020.40000000
40-1-10.85000000
50-1-11.05000000
60-1-11.25000000

When assigning columns like this, an error will be raised if the column is not of the expected length:

node_table.time = times[2:]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[18], line 1
----> 1 node_table.time = times[2:]

File /opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/tskit/tables.py:488, in BaseTable.__setattr__(self, name, value)
    486     d = self.asdict()
    487     d[name] = value
--> 488     self.set_columns(**d)
    489 else:
    490     object.__setattr__(self, name, value)

File /opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/tskit/tables.py:1210, in NodeTable.set_columns(self, flags, time, population, individual, metadata, metadata_offset, metadata_schema)
   1177 """
   1178 Sets the values for each column in this :class:`NodeTable` using the values in
   1179 the specified arrays. Overwrites existing data in all the table columns.
   (...)
   1207 :type metadata_schema: str
   1208 """
   1209 self._check_required_args(flags=flags, time=time)
-> 1210 self.ll_table.set_columns(
   1211     dict(
   1212         flags=flags,
   1213         time=time,
   1214         population=population,
   1215         individual=individual,
   1216         metadata=metadata,
   1217         metadata_offset=metadata_offset,
   1218         metadata_schema=metadata_schema,
   1219     )
   1220 )

ValueError: Input array dimensions must be equal.

Row access#

Rows can be accessed using square braces, which will return an object containing the raw values, accessible via attributes

row = ts.tables.nodes[3]
print(
    "Node 3, belonging to individual", row.individual,
    "in population", row.population,
    "exists at time", row.time,
)
print(row)
Node 3, belonging to individual 2 in population 0 exists at time 0.15
NodeTableRow(flags=0, time=0.15, population=0, individual=2, metadata=b'')

Additionally, many rows can be extracted into a new table using slices or numpy indexing

ts.tables.nodes[2:4]
idflagspopulationindividualtimemetadata
01010.00000000
10020.15000000

As with columns, to change the data in a row of a tree sequence table, you must operate on the copy that was produced by TreeSequence.dump_tables(). Here we’ll try to change the ancestral state value of site 0.

new_tables.sites[0]  # Check the site's ancestral state ("A")
SiteTableRow(position=15.0, ancestral_state='A', metadata=b'')

Since the row object(s) returned by the calls above are themselves copies, the attributes enforce read-only access:

new_tables.sites[0].ancestral_state = "C" # This will not work: attributes are read-only
---------------------------------------------------------------------------
FrozenInstanceError                       Traceback (most recent call last)
Cell In[22], line 1
----> 1 new_tables.sites[0].ancestral_state = "C" # This will not work: attributes are read-only

File <string>:4, in __setattr__(self, name, value)

FrozenInstanceError: cannot assign to field 'ancestral_state'

So to change row data you need to replace the row completely:

new_tables.sites[0] = tskit.SiteTableRow(position=15, ancestral_state="C", metadata=None)

For convenience, each row object has a replace method which creates a new table row from an old one. This is often an easier way to changing one or two attributes:

# Another way to change a single entry: create a new row to replace the old one ...
replacement_row = new_tables.sites[0].replace(ancestral_state="C")
# ... then overwrite the original
new_tables.sites[0] = replacement_row

# Or equivalently, do both in a single step
new_tables.sites[0] = new_tables.sites[0].replace(ancestral_state="C")

It is also easy to add more rows:

new_pos = 10
new_site_id = new_tables.sites.add_row(
    position=new_pos, ancestral_state="G", metadata=b"An empty site"
    # NB: For this example table we have to feed a raw byte string as metadata
)
print(f"New empty site allocated at position {new_pos} with ID {new_site_id}")
new_tables.sites
New empty site allocated at position 10 with ID 3
idpositionancestral_statemetadata
015C
142G
260T
310Gb'An empty site'

Turning tables into a tree sequence#

The TableCollection.tree_sequence() method will attempt to turn a table collection into a tree sequence. This is not guaranteed to work: it’s possible you have created a nonsensical tree sequence where, for example, a child node has multiple parents at a given position, or where edges reference non-existant nodes. The Valid tree sequence requirements section of the official tskit docs lists the requirements for a TableCollection to represent a valid TreeSequence.

Sorting#

Even if the tables represent a valid tree sequence, for efficiency reasons tree sequences also require several of their constituent tables to be sorted in a specific order. For instance, edges are required to be sorted in nondecreasing order of parent time, and sites are required to be sorted in order of position. We can ensure that a set of tables are correctly sorted by calling the TableCollection.sort() method.

In fact the new empty site that we added to the new_tables collection in the previous section was at the start of the genome, so the table collection no longer has its sites in the order required for a tree sequence. That means we need to call new_tables.sort() before turning the table collection into a tree sequence:

# altered_ts = new_tables.tree_sequence()  # This won't work
new_tables.sort()
altered_ts = new_tables.tree_sequence()  # Now it will

Now that it had been turned into a tree sequence, we can plot it:

# Plot without mutation labels, for clarity 
altered_ts.draw_svg(y_axis=True, y_gridlines=True, mutation_labels={})
_images/f6a24404a632abc7ee8d9afc2ad755da0ea030f83cfacd9a8deac311be87762f.svg

You can see that the new tree sequence has been modified as expected: there is a new empty site at position 10 (represented by a tickmark above the X axis with no mutations on it), and all the nodes except node 0 have had their times increased by 0.25.

Constructing a tree sequence#

With the tools above in mind, we can now see how to construct a tree sequence by hand. It’s unlikely that you would ever need to do this from scratch, but it’s helpful to understand how tables work. We’ll build an extremely simple tree sequence, consisting of a single tree that looks like this:

tskit.load("data/construction_example.trees").draw_svg(y_axis=True)
_images/b44df646e6bd47553ccdc3157040ac803ecf0cc2745e7da206f6c02f41b864aa.svg

Starting with an empty set of tables, we can fill, say, the node information by using NodeTable.add_row() as follows:

tables = tskit.TableCollection(sequence_length=1e3)
node_table = tables.nodes  # set up an alias, for efficiency
node_table.add_row(flags=tskit.NODE_IS_SAMPLE)  # Node 0: defaults to time 0
node_table.add_row(flags=tskit.NODE_IS_SAMPLE)  # Node 1: defaults to time 0
node_table.add_row(time=10)  # Node 2: not a sample
node_table.add_row(flags=tskit.NODE_IS_SAMPLE)  # Node 3
node_table.add_row(time=20)  # Node 4
node_table
idflagspopulationindividualtimemetadata
01-1-10
11-1-10
20-1-110
31-1-10
40-1-120

The .add_row() method is natural (and should be reasonably efficient) if new records appear one-by-one. Alternatively .set_columns() can be used to set columns for all the rows at once, by passing in numpy arrays of the appropriate type. We’ll use that for the edges:

import numpy as np
edge_table = tables.edges
edge_table.set_columns(
    left=np.array([0.0, 0.0, 0.0, 0.0]),
    right=np.array([1e3, 1e3, 1e3, 1e3]),
    parent=np.array([2, 2, 4, 4], dtype=np.int32),  # References IDs in the node table
    child=np.array([0, 1, 2, 3], dtype=np.int32),  # References IDs in the node table
)
edge_table
idleftrightparentchildmetadata
00100020
10100021
20100042
30100043

Finally we can add a site and a mutation: here we’ll use 0/1 mutations rather than ATGC.

site_id = tables.sites.add_row(position=500.0, ancestral_state='0')
tables.mutations.add_row(site=site_id, node=2, derived_state='1')
tables.sort()  # make sure the edges & sites are in the right order
ts = tables.tree_sequence()
print("A hand-built tree sequence!")
ts.draw_svg(y_axis=True)
A hand-built tree sequence!
_images/a8b34da42965bda54ecb1a658b00ed6b326e88fabcbe67e19e6987d5e247efd4.svg

Note

The ordering requirements for a valid tree sequence do not specify that rows in the node table have to be sorted in any particular order, e.g. by time. By convention, sample nodes are often the first ones listed in the node table, and this is the node order returned by simplify(), but the example above shows that sample nodes need not necessarily be those with the IDs \(0..n\).

Editing tree sequences#

If you may wish to make modifications to a tree sequence, you will immediately encounter a barrier: tree sequence objects are immutable and so cannot be edited in-place. Therefore, the way to change a tree sequence is to edit its underlying tables.

Before providing further examples of how to edit tables, it’s worth noting that tskit may provide a built-in method to achieve what you want to do, without you having to think about the underlying tables at all. In particular, several methods will return a new tree sequence that has been modified in some way. Here are some key ones:

For example, to strip singleton sites from a tree sequence, we can simply call the TreeSequence.delete_sites() method:

import msprime
import numpy as np

ts = msprime.sim_ancestry(10, random_seed=123)
ts = msprime.sim_mutations(ts, rate=10, discrete_genome=False, random_seed=123)
print(ts.num_sites, "sites in the simulated tree sequence")

# Identify sites with a singleton allele
sites_with_a_singleton_allele = []
for v in ts.variants():
    non_missing_genotypes = v.genotypes[v.genotypes != tskit.MISSING_DATA]
    if np.any(np.bincount(non_missing_genotypes) == 1):
        sites_with_a_singleton_allele.append(v.site.id)
        
# Strip those sites from the tree sequence
ts_new = ts.delete_sites(sites_with_a_singleton_allele)
print(ts_new.num_sites, "sites after removing singletons")
109 sites in the simulated tree sequence
52 sites after removing singletons

However, you want to do something not covered by a built-in method, you will need to edit a copy of the underlying tables and then create a new tree sequence from the modified tables.

Minor edits#

Previously we saw how to change specific rows by overwriting them. Here, for example, we use IndividualTableRow.replace() to change the metadata associated with an individual in one of our example tree sequences:

tables = altered_ts.dump_tables()
# Change the name "Bob" to "Robert"
tables.individuals[1] = tables.individuals[1].replace(metadata={"name": "Robert"})
edited_ts = tables.tree_sequence()
edited_ts.tables.individuals
idflagslocationparentsmetadata
002, -1{'name': 'Alice'}
102, -1{'name': 'Robert'}
20-1, -1{'name': 'Carol'}

If you are overwriting an entire column, especially something like metadata which is a ragged column, looping through each row and changing it as above will not be very performant. Instead, you should overwrite the entire column (or multiple columns) at once. We previously saw how to overwrite an entire column of numbers, and for most columns that should suffice. However, overwriting ragged columns like metadata or parents is more involved, as you cannot set them to a single numpy array. Instead, for these columns, tskit provides a number of packset_ routines, which set the appropriate column from a Python list of values.

Here’s an example of changing an entire column of metadata in one pass (in this case, note the extra step of validating and encoding before calling IndividualTable.packset_metadata(), as described in the metadata tutorial).

# Add a surname to all individuals
new_metadata = [{"name": i.metadata["name"] + " Smith"} for i in tables.individuals]
validated_metadata = [
    tables.individuals.metadata_schema.validate_and_encode_row(row) for row in new_metadata
]
tables.individuals.packset_metadata(validated_metadata)
edited_ts = tables.tree_sequence()
edited_ts.tables.individuals
idflagslocationparentsmetadata
002, -1{'name': 'Alice Smith'}
102, -1{'name': 'Robert Smith'}
20-1, -1{'name': 'Carol Smith'}

Major edits#

For larger changes, it is often worth clearing the entire table and filling it. Here, for example, we clear all mutations from the mutations table, then fill it up again, if possible using a more parsimonious placement of mutations (i.e. a placement that would better explain the observed variation among samples - see the Parsimony tutorial section). This also involves changing the ancestral state at the site if necessary.

# Make a tree sequence with loads of recurrent/back mutations, for testing only
ts = msprime.sim_ancestry(10, sequence_length=100, recombination_rate=0.1, random_seed=123)
ts = msprime.sim_mutations(ts, rate=0.1, random_seed=456)
print("Original ts has", ts.num_mutations, "mutations on", ts.num_trees, "trees")

tables = ts.dump_tables()  # Copy the table collection associated with this tree sequence
tables.mutations.clear()  # Clear all mutations in the table collection copy
variant = tskit.Variant(ts)  # Reuse the same Variant object
for tree in ts.trees():
    for site in tree.sites():
        variant.decode(site.id)  # Efficient if ids are sequential
        anc_state, mutations = tree.map_mutations(variant.genotypes, variant.alleles)
        if len(mutations) <  len(site.mutations):
            # Genotypes can be explained with a more parsimonious distribution of mutations
            if anc_state != site.ancestral_state:
                tables.sites[site.id] = tables.sites[site.id].replace(ancestral_state=anc_state)
            for mut in mutations:
                tables.mutations.append(mut.replace(site=site.id))            
        else:
            # Add the original mutations back in (note that parent ids may have changed,
            # so we will need to call compute_mutation_parents at the end)
            for mut in site.mutations:
                tables.mutations.append(mut)
tables.compute_mutation_parents()
tables.compute_mutation_times()
tables.sort()
parsimonious_ts = tables.tree_sequence()

print("Parsimonious ts has", ts.num_mutations - parsimonious_ts.num_mutations, "fewer muts")

# Optional checks
assert ts.num_trees == parsimonious_ts.num_trees
assert ts.num_sites == parsimonious_ts.num_sites
for v1, v2 in zip(ts.variants(), parsimonious_ts.variants()):
    # convert alleles to numpy arrays so we can index into them easily
    alleles1, alleles2 = np.array(v1.alleles), np.array(v2.alleles)
    # Verify the distribution of variation is the same, by comparing the allelic state
    assert np.all(alleles1[v1.genotypes] == alleles2[v2.genotypes])
Original ts has 154 mutations on 71 trees
Parsimonious ts has 33 fewer muts

Note

In the code above we iterate over the trees in the original tree sequence, but make changes to a copy of the tables associated with those trees, modifying the table collection as we go. This is a common idiom.