Topological analysis#

The branching structure of a tree is known as the tree topology. Dealing with the topological structure of trees is of key importance in dealing with genealogies such as those encoded in the tree sequence structure.

Visiting nodes#

Descendant traversal#

A basic thing to want to do is to visit all the nodes in a tree under a focal node (i.e. the focal node and all its descendants). This can be done in various traversal orders. The tskit library provides several methods to do this, such as Tree.nodes() in the Python API. By default, this method iterates over all the descendant nodes of the root(s) of the tree, and hence visits all the tree’s nodes:

import tskit
from IPython.display import SVG, display

tree = tskit.Tree.generate_balanced(10, arity=3)
display(SVG(tree.draw_svg()))

for order in ["preorder", "inorder", "postorder"]:
    print(f"{order}:\t", list(tree.nodes(order=order)))
_images/14dbe48bdb4ca87ebe46b5031c3561f7f88f430d1541e931ed2a414e92d9bcb3.svg
preorder:	 [14, 10, 0, 1, 2, 11, 3, 4, 5, 13, 6, 7, 12, 8, 9]
inorder:	 [0, 10, 1, 2, 14, 3, 11, 4, 5, 6, 13, 7, 8, 12, 9]
postorder:	 [0, 1, 2, 10, 3, 4, 5, 11, 6, 7, 8, 9, 12, 13, 14]

Providing a focal node ID allows us to traverse through that node’s descendants. For instance, here we visit node 13 and its descendants:

print("Nodes under node 13:", [u for u in tree.nodes(13)])
Nodes under node 13: [13, 6, 7, 12, 8, 9]

The node IDs returned by traversal methods allow us to to access node information. Below, for example, we use the Tree.num_children() method to find the number of children (the “arity”) of every non-leaf node, and take the average. Since the specific ordering of descendant nodes is not important in this case, we can leave it out (defaulting to preorder traversal, the most efficient order):

import numpy as np
av_arity = np.mean([tree.num_children(u) for u in tree.nodes() if not tree.is_leaf(u)])
print(f"Average arity of internal nodes: {av_arity}")
Average arity of internal nodes: 2.8

Note

In tskit, a tree can have multiple Roots, each with a descendant topology. However, for some algorithms, instead of traversing through the descendants of each root of a tree in turn, it can be helpful to start at a single node and traverse downwards through the entire genealogy. The virtual root is provided for this purpose.

Array methods#

The Tree.nodes() iterator provides a convenient way of looping over descendant node IDs, but it can be more efficient to deal with all the IDs at once, as a single array of values. This can be combined with direct memory access resulting in a high performance approach. Here, for example, is an equivalent array-based method to find the average arity of internal nodes, by counting how many times a node is referenced as a parent:

parent_id, count = np.unique(tree.parent_array[tree.preorder()], return_counts=True)
print(f"Average arity is {count[parent_id != tskit.NULL].mean()}")
Average arity is 2.8

See also

The Analysing trees tutorial provides a number of additional examples of tree traversal techniques, with different performance characteristics.

Ascending traversal#

For many applications it is useful to be able to traverse upwards from a node or set of nodes, such as the leaves. We can do this by iterating over parents. Here, for example, we traverse upwards from each of the samples in the tree:

for u in tree.samples():
    path = []
    v = u
    while v != tskit.NULL:
        path.append(v)
        v = tree.parent(v)
    print(u, "->", path)
0 -> [0, 10, 14]
1 -> [1, 10, 14]
2 -> [2, 10, 14]
3 -> [3, 11, 14]
4 -> [4, 11, 14]
5 -> [5, 11, 14]
6 -> [6, 13, 14]
7 -> [7, 13, 14]
8 -> [8, 12, 13, 14]
9 -> [9, 12, 13, 14]

Todo

Indicate that this can be made performant using numba, and link out to a tutorial on high performance methods including the numba approach.

Identifying and counting topologies#

tskit uses a combinatorial approach to identify unique topologies of rooted, leaf-labelled trees. It provides methods for enumerating all possible tree topologies, as well as converting back and forth between a tree and its position, or rank, in the enumeration of all possible topologies. These methods do not only apply to binary trees; rather, they cover general, rooted trees without unary nodes.

Tree.rank()

Return the rank of this tree.

Tree.unrank()

Return a Tree given its rank and a number of leaves.

tskit.all_trees()

Return a generator over all leaf-labelled trees of n leaves.

tskit.all_tree_shapes()

Return a generator over all tree shapes of n leaves.

tskit.all_tree_labellings()

Return a generator over all labellings of the given tree’s shape.

Note

As the number of nodes increases, the number of different topologies rises extremely rapidly (see its entry in the On-Line Encyclopedia of Integer Sequences). This combinatorial explosion is a major limitation in any analysis that attempts to explore possible topologies. For example, although the tskit.all_trees() function above will happily start generating topologies for (say) a tree of 50 leaves, the total number of possible topologies is over \(6^81\), which is of the same order as the number of atoms in the observable universe. Generating all the topologies of a tree with anything much more than 10 tips is likely to be impracticable.

Interpreting Tree Ranks#

To understand tree ranks we must look at how leaf-labelled tree topologies are enumerated. For example, we can use tskit.all_trees() to generate all possible topologies of three leaves:

import tskit
from IPython.display import display, SVG

for t in tskit.all_trees(num_leaves=3):
    display(SVG(t.draw_svg(node_labels={0: 0, 1: 1, 2: 2}, order="tree", size=(120, 120))))
_images/9b2f3b5fa841da7cf0b324300e868d55938a3ccca92d0d6097a8967dcee12f85.svg _images/fa135e5975564fc57eb490eebd9c3379259513c477e019e276e09413a2194637.svg _images/38c8375a73bf949dbca55fe284a7eb8384db862e20755cfe470a8390a55e6cc5.svg _images/17bf6de5249415ec7758a051baf3d12bed2b5e49530edf51689be410dff07750.svg

In this sequence, there exist two distinct tree shapes and each shape can be labelled in at least one unique way. Given that topologies are ordered first by their shape and then by their labelling, a tree topology can be uniquely identified by

  1. The shape of the tree

  2. The labelling of the tree’s shape

We can refer to the first tree in the above enumeration as the first labelling of the first shape of trees with three leaves, or tree \((0, 0)\). The second tree can be identified as the first labelling of the second shape, or \((1, 0)\), and so on. This pair of indexes for the shape and labelling of a tree is referred to as the rank of the tree, and can be computed using the Tree.rank() method.

ranks = [t.rank() for t in tskit.all_trees(num_leaves=3)]
print("Ranks of 3-leaf trees:", ranks)
Ranks of 3-leaf trees: [Rank(shape=0, label=0), Rank(shape=1, label=0), Rank(shape=1, label=1), Rank(shape=1, label=2)]

Note

Ranks in combinatorics are typically natural numbers. However, we refer to this tuple of shape and label rank as a rank because it serves the same purpose of indexing trees in an enumeration.

For details on how shapes and labellings are ordered, see Enumerating Topologies.

We can also reconstruct a leaf-labelled tree given its rank. This process is known as unranking, and can be performed using the Tree.unrank() method.

for rank in [(0, 0), (1, 0), (1, 1), (1, 2)]:
    t = tskit.Tree.unrank(num_leaves=3, rank=rank)
    display(SVG(t.draw_svg(node_labels={0: 0, 1: 1, 2: 2}, order="tree", size=(120, 120))))
_images/9b2f3b5fa841da7cf0b324300e868d55938a3ccca92d0d6097a8967dcee12f85.svg _images/fa135e5975564fc57eb490eebd9c3379259513c477e019e276e09413a2194637.svg _images/38c8375a73bf949dbca55fe284a7eb8384db862e20755cfe470a8390a55e6cc5.svg _images/17bf6de5249415ec7758a051baf3d12bed2b5e49530edf51689be410dff07750.svg

Examples#

One application of tree ranks is to count the different leaf-labelled topologies in a tree sequence. Since the ranks are just tuples, we can use a Python Counter to track them. Here, we count and unrank the most frequently seen topology in a tree sequence. For brevity, this example assumes samples are synonymous with leaves.

import collections
import msprime
# Simulate a tree sequence with 2 diploid individuals (i.e. 4 samples)
ts = msprime.sim_ancestry(2, sequence_length=1e8, recombination_rate=1e-7, random_seed=1)
rank_counts = collections.Counter(t.rank() for t in ts.trees())
most_freq_rank, count = rank_counts.most_common(1)[0]
most_freq_topology = tskit.Tree.unrank(ts.num_samples, most_freq_rank)
print("Most frequent topology")
display(SVG(most_freq_topology.draw_svg(node_labels={0: 0, 1: 1, 2: 2, 3: 3})))
A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.2 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "/opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/site-packages/traitlets/config/application.py", line 1075, in launch_instance
    app.start()
  File "/opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/site-packages/ipykernel/kernelapp.py", line 739, in start
    self.io_loop.start()
  File "/opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/site-packages/tornado/platform/asyncio.py", line 205, in start
    self.asyncio_loop.run_forever()
  File "/opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 545, in dispatch_queue
    await self.process_one()
  File "/opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 534, in process_one
    await dispatch(*args)
  File "/opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 437, in dispatch_shell
    await result
  File "/opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 362, in execute_request
    await super().execute_request(stream, ident, parent)
  File "/opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 778, in execute_request
    reply_content = await reply_content
  File "/opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 449, in do_execute
    res = shell.run_cell(
  File "/opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/site-packages/ipykernel/zmqshell.py", line 549, in run_cell
    return super().run_cell(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3075, in run_cell
    result = self._run_cell(
  File "/opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3130, in _run_cell
    result = runner(coro)
  File "/opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/site-packages/IPython/core/async_helpers.py", line 128, in _pseudo_sync_runner
    coro.send(None)
  File "/opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3334, in run_cell_async
    has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
  File "/opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3517, in run_ast_nodes
    if await self.run_code(code, result, async_=asy):
  File "/opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3577, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/tmp/ipykernel_2791/1339021521.py", line 2, in <module>
    import msprime
  File "/opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/site-packages/msprime/__init__.py", line 28, in <module>
    from msprime._msprime import (
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
AttributeError: _ARRAY_API not found
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Cell In[9], line 2
      1 import collections
----> 2 import msprime
      3 # Simulate a tree sequence with 2 diploid individuals (i.e. 4 samples)
      4 ts = msprime.sim_ancestry(2, sequence_length=1e8, recombination_rate=1e-7, random_seed=1)

File /opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/site-packages/msprime/__init__.py:28
      1 # Turn off flake8 and reorder-python-imports for this file.
      2 # flake8: NOQA
      3 # noreorder
   (...)
     20 # along with msprime.  If not, see <http://www.gnu.org/licenses/>.
     21 #
     22 """
     23 Msprime is a population genetics simulator.
     24 Please see the documentation at https://tskit.dev/msprime/docs/
     25 for more information.
     26 """
---> 28 from msprime._msprime import (
     29     NODE_IS_CEN_EVENT,
     30     NODE_IS_CA_EVENT,
     31     NODE_IS_MIG_EVENT,
     32     NODE_IS_RE_EVENT,
     33     NODE_IS_GC_EVENT,
     34     NODE_IS_PASS_THROUGH,
     35 )
     37 from msprime.ancestry import (
     38     AncestryModel,
     39     BetaCoalescent,
   (...)
     50     NodeType,
     51 )
     53 from msprime.core import __version__

ImportError: numpy.core.multiarray failed to import

Enumerating Topologies#

This section expands briefly on the approach used to enumerate tree topologies that serves as the basis for Tree.rank() and Tree.unrank(). To enumerate all rooted, leaf-labelled tree topologies, we first formulate a system of ordering and enumerating tree shapes. Then we define an enumeration of labellings given an arbitrary tree shape.

Enumerating Tree Shapes#

Starting with \(n = 1\), we see that the only shape for a tree with a single leaf is a single root leaf. A tree with \(n > 1\) leaves can be obtained by joining at least two trees whose number of leaves sum to \(n\). This maps very closely to the concept of integer partitions. Each tree shape of \(n\) leaves can be represented by taking a nondecreasing integer partition of \(n\) (elements of the partition are sorted in nondecreasing order) and recursively partitioning its elements. The order in which we select partitions of \(n\) is determined by the efficient rule_asc algorithm for generating them.

All tree shapes with four leaves, and the partitions that generate them, are:

All four-leaf tree shapes and their generating partitions

Note that the middle column reflects all tree shapes of three leaves in the right subtree!

* This excludes the partition \([n]\), since this would create a unary node and trees with unary nodes are inumerable (and potentially infinite).

Note

Using nondecreasing integer partitions enforces a canonical orientation on the tree shapes, where children under a node are ordered by the number of leaves below them. This is important because it prevents us from repeating trees that are topologically the same but whose children are ordered differently.

Labelling Tree Shapes#

Tree shapes are useful in and of themselves, but we can use the enumeration formulated above to go further and assign labels to the leaves of each shape.

Say we are given a tree \(T\) with \(n\) leaves, whose left-most subtree, \(T_l\), has k leaves. For each of the \(n \choose k\) ways to select labels to assign to \(T_l\), we produce a unique labelling of \(T\). This process of choosing labels is repeated for the other children of \(T\) and then recursively for the subtrees.

Looking back to the example from Interpreting Tree Ranks, we can see the second tree shape, where the tree is a strictly bifucating tree of three leaves, can be labelled in 3 different unique ways.

second_tree = tskit.Tree.unrank(num_leaves=3, rank=(1, 0))
for t in tskit.all_tree_labellings(second_tree):
    display(SVG(t.draw_svg(node_labels={0: 0, 1: 1, 2: 2}, order="tree", size=(120, 120))))

The order of the tree labellings is a direct result of the way in which combinations of labels are chosen. The implementation in tskit uses a standard lexicographic ordering to choose labels. See how the trees are sorted by the order in which the left leaf’s label was chosen.

Note

There is a caveat here regarding symmetry, similar to that of repeating tree shapes. Symmetrical trees run the risk of creating redundant labellings if all combinations of labels were exhausted. To prevent redundant labellings we impose a canonical labelling. In the case of two symmetrical subtrees, the left subtree must receive the minimum label from the label set. Notice how this is the case in the right subtrees above.

These two enumerations create a complete ordering of topologies where trees are ordered first by size (number of leaves), then by shape, then by their minimum label. It is this canonical order that enables efficient ranking and unranking of topologies.