Vignette: Parallelizing SLiM simulations in a phylogenetic tree#
Imagine you want to simulate the evolutionary history of a group. If there is no migration between any of the branches in your tree, any branches stemming from the same node can be simulated in parallel (see Fig. 7).
To do this, we’ll need to do two things: (1) be able to simulate branches in parallel, and (2) glue the resulting simulations (one per tip) back together.
Simulating the branches#
First, we need to write a SLiM script that will be used for simulating the history of each branch in our phylogeny. We will perform a simple simulation, in which each branch can have a different (but fixed) population size and length (number of ticks). Also, we will allow deleterious mutations to happen across the entire chromosome at a fixed rate.
Here is a SLiM script that would do this:
// The following constants need to be defined:
// - outfile: path to save the tree sequence of the simulation.
// - popsize: population size.
// - num_gens: run simulation for num_gens ticks.
// - infile: a string with path to tree sequence to start from; optional.
initialize()
{
initializeSLiMModelType("WF");
initializeTreeSeq();
initializeMutationRate(1e-8);
initializeMutationType("m1", 0.5, "f", -0.01);
initializeGenomicElementType("g1", m1, 0.1);
initializeGenomicElement(g1, 0, 1e6-1);
initializeRecombinationRate(1e-9);
}
1 late() {
// if no input tree sequence is provided, then start a subpopulation
if (infile == "") {
p = sim.addSubpop("p1", popsize);
} else {
// reloading must happen in late()
sim.readFromPopulationFile(infile);
parent = sim.subpopulations[0];
p = sim.addSubpopSplit(max(sim.subpopulations.id) + 1, popsize, parent);
parent.setSubpopulationSize(0);
}
p.name = popname;
}
// schedule the end of the simulation
1 late() {
finaltick = num_gens + community.tick;
community.rescheduleScriptBlock(s0, ticks=finaltick);
}
// event that saves the tree sequence
s0 1000 late() {
sim.treeSeqRememberIndividuals(sim.subpopulations.individuals);
sim.treeSeqOutput(outfile);
}
For each branch, the presence or absence of infile
tells SLiM
whether we want to start it from a previous branch or not.
If so, SLiM will read the previous tree sequence and change the
population size accordingly.
Note that when you read a tree sequence into SLiM, the tick counter will
be updated with the time encoded in the tree sequence, so we need to set the end
of the simulation as the length of the branch (num_gens
) plus the current
“time” at the end of the loaded tree sequence.
At the end of the simulation, we call sim.treeSeqRememberIndividuals
right
before saving the resulting tree sequence. This is necessary because we need to
ensure the individuals in the final generation are never dropped from the tree
sequence in future runs of SLiM which are started from the output of the
simulation, as they will later be used to glue the tree sequences together.
I encoded the phylogeny we will simulate in a simple table,
which we’ll use as df
in the code below:
Show code cell source
df = pd.read_csv("_static/phylo.tsv", sep="\t")
df = df.fillna('')
df["infile"] = df.parent + ".trees"
df["outfile"] = df.child + ".trees"
df.loc[df["infile"]==".trees", "infile"] = ""
df["is_leaf"] = ~df.child.isin(df.parent)
df
child | parent | popsize | edgelen | infile | outfile | is_leaf | |
---|---|---|---|---|---|---|---|
0 | root | 500 | 2000 | root.trees | False | ||
1 | C | root | 50 | 250 | root.trees | C.trees | True |
2 | I | root | 100 | 200 | root.trees | I.trees | False |
3 | B | I | 70 | 50 | I.trees | B.trees | True |
4 | A | I | 40 | 50 | I.trees | A.trees | True |
With our phylogeny and the simulation parameters, we are ready to run our
simulations.
One way to parallelize the simulation of sister branches is to use make
.
You do not need to know much about this tool (though it is totally worthwhile
to check it out).
The main idea here is that you can specify dependency between files and make
works its magic to run the simulations in the right order.
Here is python code that will write out a makefile from the information in df
:
f = open("sims.make", "w")
print(f"all: {' '.join(df.outfile.to_list())}\n", file=f)
for i, row in df.iterrows():
print(f"{row.outfile}: {row.infile} phylo_bgs.slim", file=f)
print(f"\tslim -d \"infile='{row.infile}'\" -d popsize={row.popsize} "
f"-d \"popname=\'{row.child}\'\" "
f"-d num_gens={row.edgelen} " f"-d \"outfile='{row.child}.trees'\" "
"phylo_bgs.slim\n",
file=f)
f.close()
Here’s the result. Again, don’t worry about the details,
but you can see that the file encodes the phylogeny
through a bunch of child : parent
“rules”:
Show code cell source
%%bash
cat sims.make
all: root.trees C.trees I.trees B.trees A.trees
root.trees: phylo_bgs.slim
slim -d "infile=''" -d popsize=500 -d "popname='root'" -d num_gens=2000 -d "outfile='root.trees'" phylo_bgs.slim
C.trees: root.trees phylo_bgs.slim
slim -d "infile='root.trees'" -d popsize=50 -d "popname='C'" -d num_gens=250 -d "outfile='C.trees'" phylo_bgs.slim
I.trees: root.trees phylo_bgs.slim
slim -d "infile='root.trees'" -d popsize=100 -d "popname='I'" -d num_gens=200 -d "outfile='I.trees'" phylo_bgs.slim
B.trees: I.trees phylo_bgs.slim
slim -d "infile='I.trees'" -d popsize=70 -d "popname='B'" -d num_gens=50 -d "outfile='B.trees'" phylo_bgs.slim
A.trees: I.trees phylo_bgs.slim
slim -d "infile='I.trees'" -d popsize=40 -d "popname='A'" -d num_gens=50 -d "outfile='A.trees'" phylo_bgs.slim
With the makefile in hand,
we can now run make, specifying the maximum number of simulations
to be run simultaneously the -j
.
(Click on the “+” icon to see SLiM’s output.)
%%bash
make -f sims.make -j 3
Show code cell output
make[1]: Entering directory '/home/runner/work/tskit-site/tskit-site/docs'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/home/runner/work/tskit-site/tskit-site/docs'
Click here for how to use python instead of make
You would have to write a recursion over the branches in your tree (starting from the root) and then parallelize the runs of sister branches somehow.
def phylo_recursion(parent, df):
print(parent)
childs = df[df.parent==parent]
print(childs)
if len(childs) == 0:
return
# you could parallelize this loop over childs with same parent
for i, row in childs.iterrows():
if not os.path.exists(row.outfile):
os.system(f"slim -d \"infile='{row.infile}'\" -d popsize={row.popsize} -d num_gens={row.edgelen} -d \"outfile='{row.child}.trees'\" phylo_bgs.slim")
phylo_recursion(row.child, df)
phylo_recursion("", df)
Putting it all together: unioning the tree sequences#
With the tree sequences in hand, we now need to glue them together. This can be done using union from tskit. For two tree sequences which share some of its past history is shared, union works by copying the non-shared parts of one of the tree sequence onto the other. The trickiest part of this operation is defining the parts that are equivalent in the two tree sequences. For that, you will have to create an array that serves as a map of node IDs between the two tree sequences.
Here is a function that will construct a map of the node IDs of two SLiM tree sequences
that correspond to the same chromosomes in SLiM
at any time older than the given time ago at which the two populations split.
Given two tree sequences other
and ts
,
the goal here is to find,
for each node born before split_time
ago in other
,
the matching node in ts
, where we can identify matching using the SLiM ID in metadata.
The code could be made easier to read by iterating over nodes,
but the following numpy-based version is much faster:
def match_nodes(other, ts, split_time):
"""
Given SLiM tree sequences `other` and `ts`, builds a numpy array with length
`other.num_nodes` in which the indexes represent the node id in `other` and the
entries represent the equivalent node id in `ts`. If a node in `other` has no
equivalent in `ts`, then the entry takes the value `tskit.NULL` (-1). The
matching is done by comparing the IDs assigned by SLiM which are kept in
node metadata. This matching of SLiM IDs is *only* done for nodes with time
older than the specified `split_time`.
"""
node_mapping = np.full(other.num_nodes, tskit.NULL)
sids0 = np.array([n.metadata["slim_id"] for n in ts.nodes()])
sids1 = np.array([n.metadata["slim_id"] for n in other.nodes()])
alive_before_split1 = (other.tables.nodes.time >= split_time)
is_1in0 = np.isin(sids1, sids0)
both = np.logical_and(alive_before_split1, is_1in0)
sorted_ids0 = np.argsort(sids0)
matches = np.searchsorted(
sids0,
sids1[both],
side='left',
sorter=sorted_ids0
)
node_mapping[both] = sorted_ids0[matches]
return node_mapping
Now we are finally ready to union our tree sequences. For that, I wrote a recursive function that goes through our data frame with the phylogeny and returns a dictionary with the merged tree sequences from the tip to the root.
merged = {
row.child : {
"ts": tskit.load(row.outfile),
"depth": row.edgelen,
"children": [row.child]
}
for i, row in df[df.is_leaf].iterrows()
}
def union_children(parent, df, merged):
print(f"Going in: {parent}")
child_rows = df[df.parent == parent]
assert (len(child_rows) == 2) or (len(childs) == 0)
if len(child_rows) == 2:
children = [row.child for _, row in child_rows.iterrows()]
for child in children:
if child not in merged:
union_children(child, df, merged)
split_time = merged[children[0]]["depth"]
assert split_time == merged[children[1]]["depth"] # ultrametric
print(f'Unioning: {children}, Split time: {split_time}')
ts0 = merged[children[0]]["ts"]
ts1 = merged[children[1]]["ts"]
node_map = match_nodes(ts1, ts0, split_time)
tsu = ts0.union(ts1, node_map, check_shared_equality=True)
# the time from tip to start of simulation is split_time plus the
# length of the edge
parent_edgelength = df[df.child==parent].edgelen.item()
merged[parent] = {
"ts": tsu,
"depth": split_time + parent_edgelength,
"children": merged[children[0]]["children"] + merged[children[1]]["children"]
}
union_children("root", df, merged)
# union of all three species tree sequences is in the root.
tsu = merged["root"]["ts"]
pops = merged["root"]["children"]
Going in: root
Going in: I
Unioning: ['B', 'A'], Split time: 50
Unioning: ['C', 'I'], Split time: 250
A slightly tricky thing we had to do there was to make sure we kept track of which population in the union’ed tree sequence corresponds to which population in our phylogeny. Happily, we’ve stored each population’s name in its metadata field, so it’s easy to match populations in the tree sequence up to what they’re supposed to be.
Let’s make sure we have the right number of present-day samples in each of the populations. To do this we need to make sure to get “alive” samples, because recall that we have saved the state of the population at each species split time.
alive = np.where(np.isclose(tsu.tables.nodes.time, 0))[0]
pop_ids = {}
for pop in tsu.populations():
if pop.metadata is not None:
pop_ids[pop.metadata['name']] = pop.id
for name in pops:
pop_samples = tsu.samples(pop_ids[name])
n_samples = sum(np.isin(pop_samples, alive)) // 2
print(f"Union-ed tree sequence has {n_samples} samples in population {name},\n"
f"\tand we specified {df[df.child==name].popsize.item()} individuals in our simulations.")
assert n_samples == df[df.child==name].popsize.item()
Union-ed tree sequence has 50 samples in population C,
and we specified 50 individuals in our simulations.
Union-ed tree sequence has 70 samples in population B,
and we specified 70 individuals in our simulations.
Union-ed tree sequence has 40 samples in population A,
and we specified 40 individuals in our simulations.
Let’s do an additional consistency check now, to see if we need to recapitate (i.e., if some trees haven’t coalesced), and to make sure that all roots are in the root population, as they should be:
# TODO: fix up
# for t in tsu.trees():
# for r in t.roots:
# assert tsu.node(r).population == pop_ids["root"]
print(f"Max number of roots: {max([t.num_roots for t in tsu.trees()])}.")
Max number of roots: 1.
Finally, we will recapitate the result with a small population size of 100, in case some trees on the root branch haven’t coalesced, and write out the result:
tsu = pyslim.recapitate(tsu, recombination_rate=1e-8, ancestral_Ne=100)
tsu.dump("final.trees")
/opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/site-packages/msprime/ancestry.py:831: TimeUnitsMismatchWarning: The initial_state has time_units=ticks but time is measured in generations in msprime. This may lead to significant discrepancies between the timescales. If you wish to suppress this warning, you can use, e.g., warnings.simplefilter('ignore', msprime.TimeUnitsMismatchWarning)
warnings.warn(message, TimeUnitsMismatchWarning)
Now we’re done, and can analyse the final tree sequence! Just for fun, I’ll look at the trees produced by the simulation. For instance, we might be curious how often there are disagreements between the species tree and the simulated gene trees (also called incomplete lineage sorting, or ILS).
To make it possible to look at the trees, I will first simplify the union-ed tree sequence to keep only two diploid samples per population.
rng = np.random.default_rng(seed=123)
ind_alive = pyslim.individuals_alive_at(tsu, 0)
# TODO: this will work in the next tskit
# ind_pops = tsu.individuals_population[ind_alive]
ind_pops = np.array([tsu.node(tsu.individual(i).nodes[0]).population for i in ind_alive])
subsample_indivs = [
rng.choice(ind_alive[ind_pops == pop_ids[name]], 2)
for name in pops
]
subsample_nodes = [
np.concatenate([tsu.individual(i).nodes for i in x])
for x in subsample_indivs
]
tsus = tsu.simplify(
np.concatenate(subsample_nodes),
filter_populations=False,
)
pop_labels = {v: k for k, v in pop_ids.items()}
SVG(tsus.draw_svg(
node_labels={
node.id: pop_labels[node.population]
for node in tsus.nodes()
if not node.time > 0.0
},
x_lim=[0,2200],
size=(800, 300),
))
Note
A possible gotcha in the code above lies in getting the time units to work out.
Note that in the SLiM script we both save and reload .trees files in the
late()
stage of the SLiM life cycle. This is important: if we had reloaded the
files in early()
, then each time we did so the “tskit time” and “SLiM time”
would become one step out of sync. This leads to errors either in union (since
if the time units in the two tree sequences do not match, union will raise an error)
or in recapitate (since recapitate assumes that the “top” of the trees are at
the number of generations ago recorded by SLiM in metadata).