Tables and editing

Tables and editing#

Internally, a tree sequence can be thought of as a set of tables, and tskit provides an interface for dealing with these tables directly. This is particularly relevant when you wish to edit or otherwise modify a tree sequence, although tables are also useful for bulk access to data contained in the tree sequence, such as the times of all nodes.

There are eight tables that together define a tree sequence, although some may be empty, and together they form a TableCollection. The tables are defined in the official tskit documentation for Table Definitions, and the Tables API section in the docs describes how to work with them. In this tutorial we give some pointers about what you can and cannot do with them.

Correspondence between tables and trees#

Consider the following sequence of trees:

ts = tskit.load("data/tables_example.trees")

_images/0327585c23c21d289094cc6394cc71ecc7e43f14197d7961b9d759c5abcc0e29.svg

Ancestral recombination events have produced three different trees that relate the three sampled genomes 0, 1, and 2 to each other along the chromosome of length 100.

Node and edge tables#

Each node in each of the above trees represents a particular ancestral genome (a haploid genome; diploid individuals would be represented by two nodes). Details about each node, including the time it lived, are stored in a NodeTable (details here)

ts.tables.nodes

id	flags	population	individual	time
0	1	0	0	0.00000000
1	1	0	0	0.00000000
2	1	0	1	0.00000000
3	0	0	2	0.15000000
4	0	-1	-1	0.60000000
5	0	-1	-1	0.80000000
6	0	-1	-1	1.00000000

Importantly, the first column, id, is not actually recorded, and is only shown when printing out node tables (as here) for convenience. The second column, flags records a 1 for the nodes that are samples, i.e., whose entire genealogical history is recorded by these trees. You can see that the sample nodes are present in all the trees, but other nodes need not be: for example node 3 is only seen in the middle tree. The individual and population columns store an index into the rows of the Individual and population tables, or tskit.NULL (-1) if the node is not associated with an individual or population. The time column records the historical age of the node, with contemporary nodes having a time of 0.

The way the nodes are related to each other (i.e. the tree topology) is stored in an EdgeTable (details here). Since some edges are present in more than one tree (e.g., node 1 inherits from node 4 across the entire sequence), each edge contains not only the IDs of a parent and a child node, but also the left and right positions which define the genomic region for which it appears in the trees:

ts.tables.edges

id	left	right	parent	child
0	20	80	3	0
1	20	80	3	2
2	0	100	4	1
3	0	20	4	2
4	80	100	4	2
5	20	80	4	3
6	80	100	5	0
7	80	100	5	4
8	0	20	6	0
9	0	20	6	4

Since node 3 is most recent, the edge that says that nodes 0 and 2 inherit from node 3 on the interval between 0.2 and 0.8 comes first. Next are the edges from node 4: there are four of these, as the edge from node 4 to node 1 is shared across the entire sequence, and for each of the three genomic intervals there is an additional child node. At this point, we know the full tree on the middle interval. Finally, edges specifying the common ancestor of 0 and 4 on the remaining intervals (parents 6 and 5 respectively) allow us to construct all trees across the entire interval.

Site and mutation tables#

Most tree sequences have DNA variation data associated with them, stored as mutations overlaid on the trees:

ts = tskit.load("data/tables_example_muts.trees")

_images/4a9eee4906ad34d1d4026f2fdc0fc571674bee2f5912f29153c28221be7ed3a3.svg

There are four mutations in the depiction above, marked by red crosses: one above node 4 on the first tree which records an A to G transition at position 15, another above node 1 in the second tree which records a G to A transition at position 42, and the final two above nodes 0 and 3 recording transitions, both at position 60, on the second tree. The positions are recorded in the SiteTable (details here):

ts.tables.sites

id	position	ancestral_state
0	15	A
1	42	G
2	60	T

As with node tables, the id column is not actually recorded, but is implied by the position in the table. The mutations themselves are recorded in the MutationTable (details here). This associates each mutation with a site ID, the ID of the node above which the mutation occurs, the derived state to which the allele has mutated, and (optionally) a time at which the mutation occured.

ts.tables.mutations

id	site	node	time	derived_state	parent
0	0	4	0.90000000	G	-1
1	1	1	0.40000000	A	-1
2	2	3	0.55000000	C	-1
3	2	0	0.10000000	T	2

Where there are multiple mutations at the same site, there can be mutations stacked on top of each other. The “parent” column therefore contains the ID of the mutation immediately above the current one at the same site, or -1 if there is no parent.

These sites and mutations allow us to calculate the DNA sequence, or haplotype, for each of the sample nodes:

for sample, h in enumerate(ts.haplotypes()):
    print(f"Sample {sample}: {h}")

Sample 0: AGT
Sample 1: GAT
Sample 2: GGC

Tables overview#

Now that we’ve introduced the most important tables that make up a tree sequence, it’s worth seeing how they come together to form a TableCollection.

The following visualization shows how they relate to one another, as well as introducing two additional tables: the IndividualTable and PopulationTable (which record which individuals and populations a node is contained in).

The tables in a tree sequence, and their interdependencies

The lines in this diagram below show where one table refers to the IDs of another table, that is, they indicate dependencies between tables.

Individual and population tables#

Previously we noted that a row in the node table could contain an index into the IndividualTable and the PopulationTable. In fact, in our example tree sequence, the node table above indicates that nodes 0 to 3 belong to the population with ID 0 in the population table. Here’s what that table looks like in this tree sequence:

ts.tables.populations

id	metadata
0	{'name': 'GBR', 'description': 'British in England and Scotland'}

The data in this table, and indeed in the individual table, is not visible in the default tree sequence visualization, but can be very useful for keeping track of information about the identity and location of nodes in the tree sequence. In particular, you can see that the populations in this tree sequence have associated metadata, a topic we will discuss in the next section.

The node table also indicated that two nodes (node 0 and 1) point to the individual with ID 0, one node (node 2) points to individual 1, and another node points to individual 2. Here’s the relevant individual table:

ts.tables.individuals

id	parents	metadata
0	2, -1	{'name': 'Alice'}
1	2, -1	{'name': 'Bob'}
2	-1, -1	{'name': 'Carol'}

Since individual 0 (Alice) is referenced by two nodes (genomes), and assuming that this is a diploid species, this indicates that we have sampled both maternal and paternal genomes for this individual. The other two indivividuals, Bob and Carol, are referenced by only one node, so either we only managed to sample one of their two genomes, or alternatively they could represent haploid individuals.

The location column of the individual table is intended for geographical location, if relevant. The parents column, if filled out, allows us to construct parts of the pedigree (the genealogical family tree) connecting the individuals in question. It contains a list of individual IDs (or tskit.NULL/-1 if a parent is unknown). In this case, the parents column indicates that Alice and Bob share the same parent, individual 2 (i.e. Carol). That means Alice and Bob must be either siblings or half siblings.

Metadata#

As can be seen above, each table row, as well as the entire table collection, can be associated with (optional) metadata, which can be stored in various formats. The format is determined by the schema of the metadata. In this example, for clarity, the metadata of the population and individual tables is stored as JSON.

Because we have chosen to use JSON for the metadata of the individuals table, it is easy to access the metadata values:

print("Information from the tree sequence")
for i in ts.individuals():
    print(f"Individual {i.id} is named {i.metadata['name']}")
print("\nEquivalent information from the associated tables")
for row in ts.tables.individuals:
    print(f"Individual name: {row.metadata['name']}")

Information from the tree sequence
Individual 0 is named Alice
Individual 1 is named Bob
Individual 2 is named Carol

Equivalent information from the associated tables
Individual name: Alice
Individual name: Bob
Individual name: Carol

Later in this tutorial there are some basic examples of editing the metadata fields; there is also an extensive Working with Metadata tutorial which provides much more detail about how to construct and use tskit metadata.

Accessing table data#

A simple way to see the contents of a table is to print it:

print(ts.tables.nodes)

╔══╤═════╤══════════╤══════════╤══════════╤════════╗
║id│flags│population│individual│time      │metadata║
╠══╪═════╪══════════╪══════════╪══════════╪════════╣
║0 │    1│         0│         0│0.00000000│        ║
║1 │    1│         0│         0│0.00000000│        ║
║2 │    1│         0│         1│0.00000000│        ║
║3 │    0│         0│         2│0.15000000│        ║
║4 │    0│        -1│        -1│0.60000000│        ║
║5 │    0│        -1│        -1│0.80000000│        ║
║6 │    0│        -1│        -1│1.00000000│        ║
╚══╧═════╧══════════╧══════════╧══════════╧════════╝

But tskit also provides access to columns and rows separately.

Column access#

An entire column in a table can be extracted as a numpy array from the table object. For instance, if n is a NodeTable, then n.time will return an array containing the birth times of the individuals whose genomes are represented by the nodes in the table.

ts.tables.nodes.time

array([0.  , 0.  , 0.  , 0.15, 0.6 , 0.8 , 1.  ])

You can also modify entire columns, but when doing this to tables from a tree sequence, you should always modify a copy of the original data, using the TreeSequence.dump_tables() method:

new_tables = ts.dump_tables()  # make a copy of the tree sequence tables, for editing
new_tables.nodes.time

array([0.  , 0.  , 0.  , 0.15, 0.6 , 0.8 , 1.  ])

Moreover, it is important to note that this numpy array itself is copy of the data stored in new_tables. In other words, modifying individual elements of n.time will not change the node table n. To change the column data, you need to take a copy of the array, modify it, and assign it back in. For example, here we add 0.25 to every time except the first in the node table:

node_table = new_tables.nodes
times = node_table.time
print("Old node times:", times)
times[1:] = times[1:] + 0.25
node_table.time = times
node_table

Old node times: [0.   0.   0.   0.15 0.6  0.8  1.  ]

id	flags	population	individual	time
0	1	0	0	0.00000000
1	1	0	0	0.25000000
2	1	0	1	0.25000000
3	0	0	2	0.40000000
4	0	-1	-1	0.85000000
5	0	-1	-1	1.05000000
6	0	-1	-1	1.25000000

When assigning columns like this, an error will be raised if the column is not of the expected length:

node_table.time = times[2:]

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[18], line 1
----> 1 node_table.time = times[2:]

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/tskit/tables.py:488, in BaseTable.__setattr__(self, name, value)
    486     d = self.asdict()
    487     d[name] = value
--> 488     self.set_columns(**d)
    489 else:
    490     object.__setattr__(self, name, value)

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/tskit/tables.py:1210, in NodeTable.set_columns(self, flags, time, population, individual, metadata, metadata_offset, metadata_schema)
   1177 """
   1178 Sets the values for each column in this :class:`NodeTable` using the values in
   1179 the specified arrays. Overwrites existing data in all the table columns.
   (...)   1207 :type metadata_schema: str
   1208 """
   1209 self._check_required_args(flags=flags, time=time)
-> 1210 self.ll_table.set_columns(
   1211     dict(
   1212         flags=flags,
   1213         time=time,
   1214         population=population,
   1215         individual=individual,
   1216         metadata=metadata,
   1217         metadata_offset=metadata_offset,
   1218         metadata_schema=metadata_schema,
   1219     )
   1220 )

ValueError: Input array dimensions must be equal.

Row access#

Rows can be accessed using square braces, which will return an object containing the raw values, accessible via attributes

row = ts.tables.nodes[3]
print(
    "Node 3, belonging to individual", row.individual,
    "in population", row.population,
    "exists at time", row.time,
)
print(row)

Node 3, belonging to individual 2 in population 0 exists at time 0.15
NodeTableRow(flags=0, time=0.15, population=0, individual=2, metadata=b'')

Additionally, many rows can be extracted into a new table using slices or numpy indexing

ts.tables.nodes[2:4]

id	flags	population	individual	time	metadata
0	1	0	1	0.00000000
1	0	0	2	0.15000000

As with columns, to change the data in a row of a tree sequence table, you must operate on the copy that was produced by TreeSequence.dump_tables(). Here we’ll try to change the ancestral state value of site 0.

new_tables.sites[0]  # Check the site's ancestral state ("A")

SiteTableRow(position=15.0, ancestral_state='A', metadata=b'')

Since the row object(s) returned by the calls above are themselves copies, the attributes enforce read-only access:

new_tables.sites[0].ancestral_state = "C" # This will not work: attributes are read-only

---------------------------------------------------------------------------
FrozenInstanceError                       Traceback (most recent call last)
Cell In[22], line 1
----> 1 new_tables.sites[0].ancestral_state = "C" # This will not work: attributes are read-only

File <string>:4, in __setattr__(self, name, value)

FrozenInstanceError: cannot assign to field 'ancestral_state'

So to change row data you need to replace the row completely:

new_tables.sites[0] = tskit.SiteTableRow(position=15, ancestral_state="C", metadata=None)

For convenience, each row object has a replace method which creates a new table row from an old one. This is often an easier way to changing one or two attributes:

# Another way to change a single entry: create a new row to replace the old one ...
replacement_row = new_tables.sites[0].replace(ancestral_state="C")
# ... then overwrite the original
new_tables.sites[0] = replacement_row

# Or equivalently, do both in a single step
new_tables.sites[0] = new_tables.sites[0].replace(ancestral_state="C")

It is also easy to add more rows:

new_pos = 10
new_site_id = new_tables.sites.add_row(
    position=new_pos, ancestral_state="G", metadata=b"An empty site"
    # NB: For this example table we have to feed a raw byte string as metadata
)
print(f"New empty site allocated at position {new_pos} with ID {new_site_id}")
new_tables.sites

New empty site allocated at position 10 with ID 3

id	position	ancestral_state	metadata
0	15	C
1	42	G
2	60	T
3	10	G	b'An empty site'

Turning tables into a tree sequence#

The TableCollection.tree_sequence() method will attempt to turn a table collection into a tree sequence. This is not guaranteed to work: it’s possible you have created a nonsensical tree sequence where, for example, a child node has multiple parents at a given position, or where edges reference non-existant nodes. The Valid tree sequence requirements section of the official tskit docs lists the requirements for a TableCollection to represent a valid TreeSequence.

Sorting#

Even if the tables represent a valid tree sequence, for efficiency reasons tree sequences also require several of their constituent tables to be sorted in a specific order. For instance, edges are required to be sorted in nondecreasing order of parent time, and sites are required to be sorted in order of position. We can ensure that a set of tables are correctly sorted by calling the TableCollection.sort() method.

In fact the new empty site that we added to the new_tables collection in the previous section was at the start of the genome, so the table collection no longer has its sites in the order required for a tree sequence. That means we need to call new_tables.sort() before turning the table collection into a tree sequence:

# altered_ts = new_tables.tree_sequence()  # This won't work
new_tables.sort()
altered_ts = new_tables.tree_sequence()  # Now it will

Now that it had been turned into a tree sequence, we can plot it:

# Plot without mutation labels, for clarity 
altered_ts.draw_svg(y_axis=True, y_gridlines=True, mutation_labels={})

_images/f6a24404a632abc7ee8d9afc2ad755da0ea030f83cfacd9a8deac311be87762f.svg

You can see that the new tree sequence has been modified as expected: there is a new empty site at position 10 (represented by a tickmark above the X axis with no mutations on it), and all the nodes except node 0 have had their times increased by 0.25.

Constructing a tree sequence#

With the tools above in mind, we can now see how to construct a tree sequence by hand. It’s unlikely that you would ever need to do this from scratch, but it’s helpful to understand how tables work. We’ll build an extremely simple tree sequence, consisting of a single tree that looks like this:

tskit.load("data/construction_example.trees").draw_svg(y_axis=True)

_images/b44df646e6bd47553ccdc3157040ac803ecf0cc2745e7da206f6c02f41b864aa.svg

Starting with an empty set of tables, we can fill, say, the node information by using NodeTable.add_row() as follows:

tables = tskit.TableCollection(sequence_length=1e3)
node_table = tables.nodes  # set up an alias, for efficiency
node_table.add_row(flags=tskit.NODE_IS_SAMPLE)  # Node 0: defaults to time 0
node_table.add_row(flags=tskit.NODE_IS_SAMPLE)  # Node 1: defaults to time 0
node_table.add_row(time=10)  # Node 2: not a sample
node_table.add_row(flags=tskit.NODE_IS_SAMPLE)  # Node 3
node_table.add_row(time=20)  # Node 4
node_table

id	flags	population	individual	time
0	1	-1	-1	0
1	1	-1	-1	0
2	0	-1	-1	10
3	1	-1	-1	0
4	0	-1	-1	20

The .add_row() method is natural (and should be reasonably efficient) if new records appear one-by-one. Alternatively .set_columns() can be used to set columns for all the rows at once, by passing in numpy arrays of the appropriate type. We’ll use that for the edges:

import numpy as np
edge_table = tables.edges
edge_table.set_columns(
    left=np.array([0.0, 0.0, 0.0, 0.0]),
    right=np.array([1e3, 1e3, 1e3, 1e3]),
    parent=np.array([2, 2, 4, 4], dtype=np.int32),  # References IDs in the node table
    child=np.array([0, 1, 2, 3], dtype=np.int32),  # References IDs in the node table
)
edge_table

id	right	parent	child
0	1000	2	0
1	1000	2	1
2	1000	4	2
3	1000	4	3

Finally we can add a site and a mutation: here we’ll use 0/1 mutations rather than ATGC.

site_id = tables.sites.add_row(position=500.0, ancestral_state='0')
tables.mutations.add_row(site=site_id, node=2, derived_state='1')
tables.sort()  # make sure the edges & sites are in the right order
ts = tables.tree_sequence()
print("A hand-built tree sequence!")
ts.draw_svg(y_axis=True)

A hand-built tree sequence!

_images/a8b34da42965bda54ecb1a658b00ed6b326e88fabcbe67e19e6987d5e247efd4.svg

Note

The ordering requirements for a valid tree sequence do not specify that rows in the node table have to be sorted in any particular order, e.g. by time. By convention, sample nodes are often the first ones listed in the node table, and this is the node order returned by simplify(), but the example above shows that sample nodes need not necessarily be those with the IDs \(0..n\).

Editing tree sequences#

If you may wish to make modifications to a tree sequence, you will immediately encounter a barrier: tree sequence objects are immutable and so cannot be edited in-place. Therefore, the way to change a tree sequence is to edit its underlying tables.

Before providing further examples of how to edit tables, it’s worth noting that tskit may provide a built-in method to achieve what you want to do, without you having to think about the underlying tables at all. In particular, several methods will return a new tree sequence that has been modified in some way. Here are some key ones:

TreeSequence.delete_sites() returns a tree sequence with certain sites deleted
TreeSequence.delete_intervals() and TreeSequence.keep_intervals() return a tree sequence whose trees cover a smaller fraction of the genome (and which can be combined with TreeSequence.trim() to change the coordinate system)
TreeSequence.simplify() returns a new tree sequence with some sample nodes removed (see the simplification tutorial).
TreeSequence.union() returns a new tree sequence formed by merging two other tree sequences together.

For example, to strip singleton sites from a tree sequence, we can simply call the TreeSequence.delete_sites() method:

import msprime
import numpy as np

ts = msprime.sim_ancestry(10, random_seed=123)
ts = msprime.sim_mutations(ts, rate=10, discrete_genome=False, random_seed=123)
print(ts.num_sites, "sites in the simulated tree sequence")

# Identify sites with a singleton allele
sites_with_a_singleton_allele = []
for v in ts.variants():
    non_missing_genotypes = v.genotypes[v.genotypes != tskit.MISSING_DATA]
    if np.any(np.bincount(non_missing_genotypes) == 1):
        sites_with_a_singleton_allele.append(v.site.id)
        
# Strip those sites from the tree sequence
ts_new = ts.delete_sites(sites_with_a_singleton_allele)
print(ts_new.num_sites, "sites after removing singletons")

109 sites in the simulated tree sequence
52 sites after removing singletons

However, you want to do something not covered by a built-in method, you will need to edit a copy of the underlying tables and then create a new tree sequence from the modified tables.

Minor edits#

Previously we saw how to change specific rows by overwriting them. Here, for example, we use IndividualTableRow.replace() to change the metadata associated with an individual in one of our example tree sequences:

tables = altered_ts.dump_tables()
# Change the name "Bob" to "Robert"
tables.individuals[1] = tables.individuals[1].replace(metadata={"name": "Robert"})
edited_ts = tables.tree_sequence()
edited_ts.tables.individuals

id	parents	metadata
0	2, -1	{'name': 'Alice'}
1	2, -1	{'name': 'Robert'}
2	-1, -1	{'name': 'Carol'}

If you are overwriting an entire column, especially something like metadata which is a ragged column, looping through each row and changing it as above will not be very performant. Instead, you should overwrite the entire column (or multiple columns) at once. We previously saw how to overwrite an entire column of numbers, and for most columns that should suffice. However, overwriting ragged columns like metadata or parents is more involved, as you cannot set them to a single numpy array. Instead, for these columns, tskit provides a number of packset_ routines, which set the appropriate column from a Python list of values.

Here’s an example of changing an entire column of metadata in one pass (in this case, note the extra step of validating and encoding before calling IndividualTable.packset_metadata(), as described in the metadata tutorial).

# Add a surname to all individuals
new_metadata = [{"name": i.metadata["name"] + " Smith"} for i in tables.individuals]
validated_metadata = [
    tables.individuals.metadata_schema.validate_and_encode_row(row) for row in new_metadata
]
tables.individuals.packset_metadata(validated_metadata)
edited_ts = tables.tree_sequence()
edited_ts.tables.individuals

id	parents	metadata
0	2, -1	{'name': 'Alice Smith'}
1	2, -1	{'name': 'Robert Smith'}
2	-1, -1	{'name': 'Carol Smith'}

Major edits#

For larger changes, it is often worth clearing the entire table and filling it. Here, for example, we clear all mutations from the mutations table, then fill it up again, if possible using a more parsimonious placement of mutations (i.e. a placement that would better explain the observed variation among samples - see the Parsimony tutorial section). This also involves changing the ancestral state at the site if necessary.

# Make a tree sequence with loads of recurrent/back mutations, for testing only
ts = msprime.sim_ancestry(10, sequence_length=100, recombination_rate=0.1, random_seed=123)
ts = msprime.sim_mutations(ts, rate=0.1, random_seed=456)
print("Original ts has", ts.num_mutations, "mutations on", ts.num_trees, "trees")

tables = ts.dump_tables()  # Copy the table collection associated with this tree sequence
tables.mutations.clear()  # Clear all mutations in the table collection copy
variant = tskit.Variant(ts)  # Reuse the same Variant object
for tree in ts.trees():
    for site in tree.sites():
        variant.decode(site.id)  # Efficient if ids are sequential
        anc_state, mutations = tree.map_mutations(variant.genotypes, variant.alleles)
        if len(mutations) <  len(site.mutations):
            # Genotypes can be explained with a more parsimonious distribution of mutations
            if anc_state != site.ancestral_state:
                tables.sites[site.id] = tables.sites[site.id].replace(ancestral_state=anc_state)
            for mut in mutations:
                tables.mutations.append(mut.replace(site=site.id))            
        else:
            # Add the original mutations back in (note that parent ids may have changed,
            # so we will need to call compute_mutation_parents at the end)
            for mut in site.mutations:
                tables.mutations.append(mut)
tables.compute_mutation_parents()
tables.compute_mutation_times()
tables.sort()
parsimonious_ts = tables.tree_sequence()

print("Parsimonious ts has", ts.num_mutations - parsimonious_ts.num_mutations, "fewer muts")

# Optional checks
assert ts.num_trees == parsimonious_ts.num_trees
assert ts.num_sites == parsimonious_ts.num_sites
for v1, v2 in zip(ts.variants(), parsimonious_ts.variants()):
    # convert alleles to numpy arrays so we can index into them easily
    alleles1, alleles2 = np.array(v1.alleles), np.array(v2.alleles)
    # Verify the distribution of variation is the same, by comparing the allelic state
    assert np.all(alleles1[v1.genotypes] == alleles2[v2.genotypes])

Original ts has 154 mutations on 71 trees
Parsimonious ts has 33 fewer muts