Completing forwards simulations
Completing forwards simulations#
msprime simulator generates tree sequences using the backwards in
time coalescent model. But it is also possible to output tree sequences
simulators such as SLiM
and fwdpy11 (see the
Building a forward simulator tutorial).
There are many advantages to using forward-time simulators, but they
are usually quite slow compared to similar coalescent simulations. In this
section we show how to combine the best of both approaches by simulating
the recent past using a forwards-time simulator and then complete the
simulation of the ancient past using
msprime. (We sometimes refer to this
“recapitation”, as we can think of it as adding a “head” onto a tree sequence.)
First, we define a simple Wright-Fisher simulator which returns a tree sequence with the properties that we require (please see the msprime documentation for a formal description of these properties):
import tskit import msprime import random import numpy as np def wright_fisher(N, T, L=100, random_seed=None): """ Simulate a Wright-Fisher population of N haploid individuals with L discrete loci for T generations. Based on Algorithm W from https://doi.org/10.1371/journal.pcbi.1006581 """ random.seed(random_seed) tables = tskit.TableCollection(L) tables.populations.add_row() P = np.arange(N, dtype=int) for _ in range(N): tables.nodes.add_row(time=T, flags=0, population=0) t = T while t > 0: t -= 1 Pp = P.copy() for j in range(N): u = tables.nodes.add_row(time=t, flags=0, population=0) Pp[j] = u a = random.randint(0, N - 1) b = random.randint(0, N - 1) x = random.randint(1, L - 1) tables.edges.add_row(0, x, P[a], u) tables.edges.add_row(x, L, P[b], u) P = Pp # Now do some table manipulations to ensure that the tree sequence # that we output has the form that msprime needs to finish the # simulation. (Note: table columns are not modified in place because the # tables API does not currently allow direct access to memory.) # Mark the extant population as samples. flags = tables.nodes.flags flags[P] = tskit.NODE_IS_SAMPLE tables.nodes.flags = flags tables.sort() # Simplify with respect to the current generation, but ensuring we keep the # ancient nodes from the initial population. tables.simplify(keep_input_roots=True) return tables.tree_sequence()
We then run a tiny forward simulation of 10 two-locus individuals for 5 generations, and print out the resulting trees:
num_loci = 2 N = 10 wf_ts = wright_fisher(N, 5, L=num_loci, random_seed=3) wf_ts.draw_svg()
Because our Wright Fisher simulation ran for only 5 generations, there has not been enough time for the trees to fully coalesce. Therefore, instead of having one root, the trees have several — the first tree has 2 and the second 4. Nodes 16, 17, 18, and 19 in this simulation represent the members of the initial population of the simulation that have genetic descendants at the end of the simulation. These unary branches joining samples and coalesced subtrees to the nodes in the initial generation are essential as they allow use to correctly assemble the various fragments of ancestral material into chromosomes when creating the initial conditions for the coalescent simulation. (Please see the msprime documentation for for more details on the required properties of input tree sequences.)
The process of completing this tree sequence using a coalescent simulation begins by first examining the root segments on the input trees. We get the following segments:
[(0, 2, 17), (0, 2, 18), (1, 2, 19), (1, 2, 16)]
where each segment is a
(left, right, node) tuple. As nodes 17 and 18 are
present in both trees, they have segments spanning both loci. Nodes 16 and 19 are
present only in the second tree, and so they have ancestral segments only for
the second locus. Note that this means that we do not simulate the ancestry
of the entire initial generation of the simulation, but rather the exact
minimum that we need in order to complete the ancestry of the current
generation. For instance, root
19 has not coalesced over the interval from
2.0, while root
17 has not coalesced over the entire segment
We run the coalescent simulation to complete this tree sequence using the
initial_state argument to
msprime.sim_ancestry(). Because we have simulated a
two locus system with a recombination rate of
1 / num_loci per generation
in the Wright-Fisher model, we want to use the same system in the coalescent simulation.
Note that we set the
ploidy argument to 1 here because our forward time simulation
is haploid and msprime uses a diploid time scale by default.
coalesced_ts = msprime.sim_ancestry( population_size=N, initial_state=wf_ts, recombination_rate=1 / num_loci, ploidy=1, random_seed=7) coalesced_ts.draw_svg()
/opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/msprime/ancestry.py:831: TimeUnitsMismatchWarning: The initial_state has time_units=unknown but time is measured in generations in msprime. This may lead to significant discrepancies between the timescales. If you wish to suppress this warning, you can use, e.g., warnings.simplefilter('ignore', msprime.TimeUnitsMismatchWarning) warnings.warn(message, TimeUnitsMismatchWarning) /opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/msprime/demography.py:1176: IncompletePopulationMetadataWarning: No metadata schema present in population table, not recording metadata warnings.warn(
The trees have fully coalesced and we’ve successfully combined a forwards-time Wright-Fisher simulation with a coalescent simulation: hooray!
Why keep input roots (i.e., the initial generation)?#
We can now see why it is essential that we take care to preserve the roots of all
trees when we simplified the tree sequence (by passing
so that the initial generation can be properly used as the
initial_state argument to
msprime.sim_ancestry(). In the example above, if node
18 was not in the tree sequence, we would not know that the segment that
10 inherits from on
[0.0, 1.0) and the segment that node
inherits from on
[1.0, 2.0) both exist in the same node.
Note that although the portions of intial generation (above, nodes
19) must be in the tree sequence, they do not have to be
samples, and their entire genomes need not be represented (e.g., node
only present on
[1.0, 2.0)). This allows
msprime.sim_ancestry() to not simulate
the entire history of the first generation, but only what is necessary to complete
any uncoalesced trees. Happily, this is easily done with the
simplify(). Note that this argument would need
to be provided to the periodic
which are essential in practical simulation, but that we skipped in the toy simulator above.
The trees that we output from this combined forwards and backwards simulation
process have some slightly odd properties that are important to be aware of.
In the example above, we can see that the old roots are still present in both trees,
even through they have only one child and are clearly redundant.
This is because the tables of
initial_state have been retained, without modification,
at the top of the tables of the output tree sequence. While this
redundancy is not important for many tasks, there are some cases where
they may cause problems:
When computing statistics on the number of nodes, edges or trees in a tree sequence, having these unary edges and redundant nodes will slightly inflate the values.
If you are computing the overall tree “height” by taking the time of the root node, you may overestimate the height because there is a unary edge above the “real” root (this would happen if one of the trees had already coalesced in the forwards-time simulation).
For these reasons it may be better to remove this redundancy from your
computed tree sequence which is easily done using the
final_ts = coalesced_ts.simplify() coalesced_ts.draw_svg()
This final tree sequence is topologically identical to the original tree sequence,
but has the redundant nodes and edges removed. Note also that the node IDs have been
reassigned so that the samples are 0 to 9 — if you need the IDs from the original
tree sequence, please set
map_nodes=True when calling
simplify to get a
mapping between the two sets of IDs.