Ancestry simulations#
Msprime simulates the ancestry of sampled genomes under a number of different backwards-in-time population genetic models. The simulated ancestry is represented as a succinct tree sequence using the tskit library, which provides an extensive suite of operations for analysing genealogical trees and DNA sequence data. See the Getting started with tskit tutorial for more information on how to process data using tskit.
Here we run a simple simulation and show a summary of the resulting tree sequence:
ts = msprime.sim_ancestry(2)
ts
Tree Sequence | |
---|---|
Trees | 1 |
Sequence Length | 1.0 |
Time Units | generations |
Sample Nodes | 4 |
Total Size | 1.8 KiB |
Metadata | No Metadata |
Table | Rows | Size | Has Metadata |
---|---|---|---|
Edges | 6 | 200 Bytes | |
Individuals | 2 | 80 Bytes | |
Migrations | 0 | 8 Bytes | |
Mutations | 0 | 16 Bytes | |
Nodes | 7 | 204 Bytes | |
Populations | 1 | 224 Bytes | ✅ |
Provenances | 1 | 1011 Bytes | |
Sites | 0 | 16 Bytes |
The sim_ancestry()
function has many parameters to specify
the details of the simulations. The API documentation
contains a precise description of the parameter; the sections
in this page provide some explanation and examples of how to use these
parameters.
Note
It’s important to note that sim_ancestry()
only simulates
the ancestral trees for samples: if we want actual sequence data
then we also need to simulate mutations on these trees.
See the Mutation simulations section for more information on how to do
this.
Quick reference#
sim_ancestry()
Simulate ancestral tree sequence topology
Models
StandardCoalescent
Coalescent with recombination (“hudson”)
SmcApproxCoalescent
Sequentially Markov Coalescent (“smc”)
SmcPrimeApproxCoalescent
SMC’(“smc_prime”)
DiscreteTimeWrightFisher
Generation-by-generation Wright-Fisher
FixedPedigree
Simulated conditioned on a given pedigree
BetaCoalescent
Beta coalescent multiple-merger
DiracCoalescent
Dirac coalescent multiple-merger
SweepGenicSelection
Selective sweep at a linked locus
Specifying samples#
The samples
argument to sim_ancestry()
defines the number
of sampled individuals whose history will be simulated. There are three different
forms; the samples
argument can be:
an integer, interpreted as the number of sampled individuals to draw in a single population model;
a dictionary mapping population references (either integer IDs or names) to the number of sampled individuals for that population;
a list of
SampleSet
objects, which provide more flexibility in how groups of similar samples are drawn from populations.
In the simplest case, provide a single integer which defines the number of samples:
ts = msprime.sim_ancestry(2, random_seed=1234)
ts
Tree Sequence | |
---|---|
Trees | 1 |
Sequence Length | 1.0 |
Time Units | generations |
Sample Nodes | 4 |
Total Size | 1.8 KiB |
Metadata | No Metadata |
Table | Rows | Size | Has Metadata |
---|---|---|---|
Edges | 6 | 200 Bytes | |
Individuals | 2 | 80 Bytes | |
Migrations | 0 | 8 Bytes | |
Mutations | 0 | 16 Bytes | |
Nodes | 7 | 204 Bytes | |
Populations | 1 | 224 Bytes | ✅ |
Provenances | 1 | 1005 Bytes | |
Sites | 0 | 16 Bytes |
Here we specify two sample individuals, which at the default ploidy of two, gives us four sample nodes.
Warning
It is important to note that the number of samples refers to the number of individuals not the number of nodes (monoploid genomes). See the Ploidy section for details.
This integer form for the samples
argument can only be used
when the demography
is a single population model.
Populations#
The next example illustrates one usage of the dictionary form of the samples
argument, in which the keys refer to populations in the
demographic model and
the values are the number of samples to draw.
(See the Demography section for more details on
the demography
argument to sim_ancestry()
.)
We first create a Demography
object representing
a 10 deme linear stepping stone model. Then, we run the simulation
with 3 diploid samples each drawn from the first and last demes in this
linear habitat.
N = 10
demography = msprime.Demography.stepping_stone_model(
[100] * N,
migration_rate=0.1,
boundaries=True)
ts = msprime.sim_ancestry({0: 3, N - 1: 3}, demography=demography)
ts
Tree Sequence | |
---|---|
Trees | 1 |
Sequence Length | 1.0 |
Time Units | generations |
Sample Nodes | 12 |
Total Size | 5.6 KiB |
Metadata | No Metadata |
Table | Rows | Size | Has Metadata |
---|---|---|---|
Edges | 22 | 712 Bytes | |
Individuals | 6 | 192 Bytes | |
Migrations | 0 | 8 Bytes | |
Mutations | 0 | 16 Bytes | |
Nodes | 23 | 652 Bytes | |
Populations | 10 | 593 Bytes | ✅ |
Provenances | 1 | 3.3 KiB | |
Sites | 0 | 16 Bytes |
The keys in the dictionary can also be the string names of the
population (see the Identifying populations section),
which is useful when we are simulating from empirically
estimated models. For example, here create a Demography
object
based on a species tree, and then draw samples using the species names.
demography = msprime.Demography.from_species_tree(
"(((human:5.6,chimpanzee:5.6):3.0,gorilla:8.6):9.4,orangutan:18.0)",
time_units="myr",
initial_size=10**4,
generation_time=20)
ts = msprime.sim_ancestry({"gorilla": 2, "human": 4}, demography=demography)
ts
Tree Sequence | |
---|---|
Trees | 1 |
Sequence Length | 1.0 |
Time Units | generations |
Sample Nodes | 12 |
Total Size | 5.0 KiB |
Metadata | No Metadata |
Table | Rows | Size | Has Metadata |
---|---|---|---|
Edges | 22 | 712 Bytes | |
Individuals | 6 | 192 Bytes | |
Migrations | 0 | 8 Bytes | |
Mutations | 0 | 16 Bytes | |
Nodes | 23 | 652 Bytes | |
Populations | 7 | 481 Bytes | ✅ |
Provenances | 1 | 2.8 KiB | |
Sites | 0 | 16 Bytes |
When samples are drawn using this dictionary form it is always
at the population’s Default sampling time.
If you wish to draw samples at a different time, then you must use the
more general form of a list of SampleSet
objects.
See also
See the Demography section for more information on how to specify demographic models in a simulation.
Sampling time#
By default the samples that we draw from a Population
are
at the population’s default_sampling_time
.
See the Default sampling time
section for more information about how this is
set for populations in a demographic model.
We can manually control the time at which samples
are drawn using list of SampleSet
objects form for the
samples argument.
In this example we create two diploid sample individuals, one at the present time and one taken 50 generations ago, representing one modern and one ancient individual. By default, when we draw the trees, the nodes that belong to samples are drawn as squares whereas nodes that are not samples are drawn as circles.
ts = msprime.sim_ancestry(
samples = [
msprime.SampleSet(1),
msprime.SampleSet(1, time=50)
],
population_size=100,
random_seed=42
)
SVG(ts.draw_svg(y_axis=True))
See also
See the Order of event execution section for details of the relative order in which sampling events occur.
Sample details#
Sample individuals and nodes are allocated sequentially in the order that they are specified. For example:
demography = msprime.Demography.island_model([10, 10], migration_rate=1)
ts = msprime.sim_ancestry(
samples=[
msprime.SampleSet(1, population=1),
msprime.SampleSet(2, population=0)],
demography=demography,
end_time=0)
print(ts.tables.individuals)
print(ts.tables.nodes)
╔══╤═════╤════════╤═══════╤════════╗
║id│flags│location│parents│metadata║
╠══╪═════╪════════╪═══════╪════════╣
║0 │ 0│ │ │ ║
║1 │ 0│ │ │ ║
║2 │ 0│ │ │ ║
╚══╧═════╧════════╧═══════╧════════╝
╔══╤═════╤══════════╤══════════╤════╤════════╗
║id│flags│population│individual│time│metadata║
╠══╪═════╪══════════╪══════════╪════╪════════╣
║0 │ 1│ 1│ 0│ 0│ ║
║1 │ 1│ 1│ 0│ 0│ ║
║2 │ 1│ 0│ 1│ 0│ ║
║3 │ 1│ 0│ 1│ 0│ ║
║4 │ 1│ 0│ 2│ 0│ ║
║5 │ 1│ 0│ 2│ 0│ ║
╚══╧═════╧══════════╧══════════╧════╧════════╝
(Because we’re only interested in the sampled nodes and individuals we
stopped the simulation from actually doing anything by setting
end_time=0
— see the Setting the start time section
for more information.) Here we define three sample individuals,
and we therefore have three rows in the individual table.
Because these are diploid individuals, the node table contains
six sample nodes. If we look at the individual
column in the
node table we can see that the first two nodes correspond to individual
0
, the next two nodes individual 1
, etc. The sample configuration
stated that the first sample should come from population 1
and
the other two from population 0
, and we can see this reflected
in the population
column of the node table. (Somewhat confusingly,
population values are associated with nodes rather than individuals;
this is mostly for historical reasons.)
The SampleSet
class has a number of attributes which default
to None
. If these are set they will override the values
which might be specified elsewhere. For example, we can specify
mixed ploidy samples via the ploidy
attribute of SampleSet
:
ts = msprime.sim_ancestry(
samples=[
msprime.SampleSet(1, ploidy=3),
msprime.SampleSet(2)],
ploidy=1,
end_time=0)
print(ts.tables.individuals)
print(ts.tables.nodes)
╔══╤═════╤════════╤═══════╤════════╗
║id│flags│location│parents│metadata║
╠══╪═════╪════════╪═══════╪════════╣
║0 │ 0│ │ │ ║
║1 │ 0│ │ │ ║
║2 │ 0│ │ │ ║
╚══╧═════╧════════╧═══════╧════════╝
╔══╤═════╤══════════╤══════════╤════╤════════╗
║id│flags│population│individual│time│metadata║
╠══╪═════╪══════════╪══════════╪════╪════════╣
║0 │ 1│ 0│ 0│ 0│ ║
║1 │ 1│ 0│ 0│ 0│ ║
║2 │ 1│ 0│ 0│ 0│ ║
║3 │ 1│ 0│ 1│ 0│ ║
║4 │ 1│ 0│ 2│ 0│ ║
╚══╧═════╧══════════╧══════════╧════╧════════╝
(Again, we stop the simulation immediately because we’re only interested in the initial samples.) Here we have three sampled individuals again but in this they are mixed ploidy: the first individual is triploid and the other two are haploid.
Warning
It is vital to note that setting the ploidy
value
of SampleSet
objects only affects sampling and does
not affect the actual simulation. In this case, the simulation
will be run on the haploid time scale. See the
Ploidy section for more details.
If you wish to set up the node and individual IDs in some other way,
it is possible to do so by using the initial_state
parameter.
See the Specifying the initial state for more information
on how to use this (advanced) feature.
Demography#
A demographic model is a description of a
set of populations, the migration rates between then and the
events that modify the populations over time. Ancestry simulations
can take one of these models as input via the demography
parameter to sim_ancestry()
.
See also
Please see the Demographic models section for details on how to create and debug demographic models, and the Specifying samples section for help on how to define the locations and times of sampled individuals.
We don’t need to explicitly specify a Demography
for every ancestry simulation, however. By default, we assume a
single fixed-size population. For the default
standard coalescent ancestry
model we assume a population size of 1, so that time is measured
in “coalescent units”. This is useful for theoretical work where
we usually work in scaled coalescent time.
Important
By default msprime uses the diploid coalescent time scale; see the Ploidy section for more details.
Outside of theoretical analysis, working with scaled
coalescent time can be confusing and it’s usually better to
explicitly set the population size so that the output
times are directly interpretable in generations.
The population_size
parameter to sim_ancestry()
allows us to set the size of the single constant
sized population in the default demography.
For example, here we run the same simulation twice, one with the default population size and one with a population size of 100 (this is the diploid population size because the default Ploidy is 2):
ts = msprime.sim_ancestry(2, random_seed=1234)
SVG(ts.draw_svg(y_axis=True))
ts = msprime.sim_ancestry(2, population_size=100, random_seed=1234)
SVG(ts.draw_svg(y_axis=True))
We can see that the results are identical, except the node times have been scaled by the population size.
The Discrete Time Wright-Fisher model is different. Because it explicitly works with a population of N individuals we get an error if we don’t specify a population size:
msprime.sim_ancestry(2, model="dtwf", random_seed=1234)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[11], line 1
----> 1 msprime.sim_ancestry(2, model="dtwf", random_seed=1234)
File ~/work/tskit-site/tskit-site/msprime/ancestry.py:1290, in sim_ancestry(samples, demography, sequence_length, discrete_genome, recombination_rate, gene_conversion_rate, gene_conversion_tract_length, population_size, ploidy, model, initial_state, start_time, end_time, record_migrations, record_full_arg, additional_nodes, coalescing_segments_only, num_labels, random_seed, num_replicates, replicate_index, record_provenance)
1265 parameters = dict(
1266 command="sim_ancestry",
1267 samples=samples,
(...)
1287 # replicate index is excluded as it is inserted for each replicate
1288 )
1289 provenance_dict = provenance.get_provenance_dict(parameters)
-> 1290 sim = _parse_sim_ancestry(
1291 samples=samples,
1292 sequence_length=sequence_length,
1293 recombination_rate=recombination_rate,
1294 gene_conversion_rate=gene_conversion_rate,
1295 gene_conversion_tract_length=gene_conversion_tract_length,
1296 discrete_genome=discrete_genome,
1297 population_size=population_size,
1298 demography=demography,
1299 ploidy=ploidy,
1300 model=model,
1301 initial_state=initial_state,
1302 start_time=start_time,
1303 end_time=end_time,
1304 record_migrations=record_migrations,
1305 record_full_arg=record_full_arg,
1306 additional_nodes=additional_nodes,
1307 coalescing_segments_only=coalescing_segments_only,
1308 num_labels=num_labels,
1309 random_seed=random_seed,
1310 )
1311 return _wrap_replicates(
1312 sim,
1313 num_replicates=num_replicates,
1314 replicate_index=replicate_index,
1315 provenance_dict=provenance_dict,
1316 )
File ~/work/tskit-site/tskit-site/msprime/ancestry.py:1009, in _parse_sim_ancestry(samples, sequence_length, recombination_rate, gene_conversion_rate, gene_conversion_tract_length, discrete_genome, population_size, demography, ploidy, model, initial_state, start_time, end_time, record_migrations, record_full_arg, additional_nodes, coalescing_segments_only, num_labels, random_seed, init_for_debugger)
1005 if is_dtwf:
1006 # A default size of 1 isn't so smart for DTWF and almost certainly
1007 # an error.
1008 if population_size is None:
-> 1009 raise ValueError(
1010 "When using the DTWF model, the population size must be set "
1011 "explicitly, either using the population_size or demography "
1012 "arguments."
1013 )
1014 if initial_state is not None:
1015 if population_size is None:
ValueError: When using the DTWF model, the population size must be set explicitly, either using the population_size or demography arguments.
Note
Because the DTWF models populations of individuals rather than using continuous time approximations, simulations with different population sizes will yield entirely different results; we are not simply rescaling time by using different population sizes. Furthermore, simulations with large population sizes will take relatively more CPU time and memory.
Ploidy#
The ploidy
argument to sim_ancestry()
has two effects:
It sets the default ploidy, that is, the number of nodes per sample individual.
For continuous time coalescent models, it sets the time scale.
Both of these are somewhat confusing, so let’s look at them one at a time.
Default number of nodes per sample individual#
As discussed in the Sample details section, the number of nodes (i.e., monoploid genomes) that we allocate per sample individual is determined by the ploidy. Here, we specify 2 sample individuals with a ploidy of 3:
ts = msprime.sim_ancestry(samples=2, ploidy=3, end_time=0)
print(ts.tables.individuals)
print(ts.tables.nodes)
╔══╤═════╤════════╤═══════╤════════╗
║id│flags│location│parents│metadata║
╠══╪═════╪════════╪═══════╪════════╣
║0 │ 0│ │ │ ║
║1 │ 0│ │ │ ║
╚══╧═════╧════════╧═══════╧════════╝
╔══╤═════╤══════════╤══════════╤════╤════════╗
║id│flags│population│individual│time│metadata║
╠══╪═════╪══════════╪══════════╪════╪════════╣
║0 │ 1│ 0│ 0│ 0│ ║
║1 │ 1│ 0│ 0│ 0│ ║
║2 │ 1│ 0│ 0│ 0│ ║
║3 │ 1│ 0│ 1│ 0│ ║
║4 │ 1│ 0│ 1│ 0│ ║
║5 │ 1│ 0│ 1│ 0│ ║
╚══╧═════╧══════════╧══════════╧════╧════════╝
Note
Because we’re only interested in the initial conditions here we
set the end_time
to zero: see the Setting the start time
section for more details.
Our individual table has two rows (the number of samples
)
and node table has 6 rows: three for each of the sampled
individuals.
Note that this is the default, and can be overridden by the
SampleSet.ploidy
value, as described in the
Sample details section.
Coalescent time scales#
The ploidy argument also
affects the time scale on which coalescence occurs,
since it affects the total number of genomes in the population
(see also population size).
For example, consider the following two simulations, which
are identical except for different ploidy
values
(note we use the SampleSet
object to make
sure we allocate the same number of nodes, as discussed
in the Default number of nodes per sample individual section):
ts = msprime.sim_ancestry(
samples=[msprime.SampleSet(4, ploidy=1)],
population_size=1,
ploidy=1,
random_seed=1234
)
SVG(ts.draw_svg(y_axis=True))
ts = msprime.sim_ancestry(
samples=[msprime.SampleSet(4, ploidy=1)],
population_size=1,
ploidy=2,
random_seed=1234
)
SVG(ts.draw_svg(y_axis=True))
The resulting tree sequences are identical, except that node time differ by a factor of 2, because in the underlying coalescent algorithm the total number of genomes in the population acts as a time scaling.
Todo
Add a discussion here about the Lambda coalescents an how things are more complicated for them.
Important
The Discrete Time Wright-Fisher model is an explicitly diploid model
and it is therefore
an error to run it with any ploidy
value other than 2.
Genome properties#
Sequence length#
There are a number of ways to specify the length of the chromosome that we want to simulate. In the absence of recombination and gene conversion, we assume a genome of length 1:
ts = msprime.sim_ancestry(2, random_seed=1)
SVG(ts.draw_svg())
The sequence_length
parameter determines the length of the
simulated sequence:
ts = msprime.sim_ancestry(2, sequence_length=100, random_seed=1)
SVG(ts.draw_svg())
By default there is no recombination or gene conversion, and so we therefore still have only one tree defining the ancestry of the samples. The sequence length makes little difference if we are considering only the ancestry of a non-recombining sequence, however, this will be important if we are interested in adding mutations later.
Recombination#
As we trace the ancestry of a sample of genomes backwards in time,
recombination has the effect of creating segments of genome
that have different (but highly correlated) genealogical trees.
The sim_ancestry()
function has a recombination_rate
argument that allows us to flexibly specify both uniform and
non-uniform rates of recombination along the genome.
The simplest way to run simulations with recombination is to
use the recombination_rate
and sequence_length
parameters:
ts = msprime.sim_ancestry(
3, recombination_rate=0.1, sequence_length=10, random_seed=2)
SVG(ts.draw_svg())
Here, we have four distinct trees along the genome, created by recombination events.
We can also provide a RateMap
argument to recombination_rate
:
rate_map = msprime.RateMap.uniform(sequence_length=10, rate=0.1)
other_ts = msprime.sim_ancestry(3, recombination_rate=rate_map, random_seed=2)
assert ts.equals(other_ts, ignore_provenance=True)
There are two things to note here:
We didn’t need to specify a
sequence_length
argument tosim_ancestry()
, as this information is encapsulated by therate_map
instance. (We can provide a value forsequence_length
if we wish, but it must be equal to thesequence_length
property of theRateMap
).The result of the simulation for a Uniform rate maps rate map with the same rate and
sequence_length
is identical to the first simulation above.
We can simulate non-uniform recombination by defining our RateMap
appropriately. For example, here we define a map in which the
rate over the first half of the sequence is one tenth of the second half:
rate_map = msprime.RateMap(position=[0, 10, 20], rate=[0.01, 0.1])
ts = msprime.sim_ancestry(3, recombination_rate=rate_map, random_seed=2)
SVG(ts.draw_svg())
Because the recombination rate in from position 10 to 20 is so much higher than the rate from 0 to 10, we can see that all the recombinations have fallen in the high recombination region.
See also
See the Creating rate maps section for more examples of creating
RateMap
instances.See the
RateMap.read_hapmap()
method for reading in genetic maps from text files.
Gene conversion#
Gene conversion events are defined by two parameters: the rate at which gene
conversion events are initiated and the distribution of tract lengths.
In the default case of discrete genome coordinates, tract lengths are drawn
from a geometric distribution with mean gene_conversion_tract_length
(which
must be at least 1). Note that if we specify a tract length of 1, then all
gene conversion tracts will have length exactly 1.
In the following example one gene conversion event of length 1 has occurred:
ts = msprime.sim_ancestry(
3, gene_conversion_rate=0.02, gene_conversion_tract_length=1,
sequence_length=10, random_seed=3)
SVG(ts.draw_svg())
Note
We can see the specific effect of gene conversion here in this simple example, because the trees at either side of the event are identical. However, this will not be true in general.
Continuous genomes can also be used. In this case the parameters define the rate at which gene conversion events are initiated per unit of sequence length and the mean of the exponentially distributed gene conversion tract lengths. The following example shows the same simulation as above but for a continuous genome:
ts = msprime.sim_ancestry(
3, gene_conversion_rate=0.02, gene_conversion_tract_length=1,
sequence_length=10, random_seed=3, discrete_genome=False)
SVG(ts.draw_svg())
Recombination and gene conversion at constant rates can be simulated alongside. In the following example recombinations at site 60 and 97 have occurred in addition to a gene conversion event covering the tract from site 76 to site 80.
ts = msprime.sim_ancestry(
3, sequence_length = 100, recombination_rate=0.003,
gene_conversion_rate=0.002, gene_conversion_tract_length=5,
random_seed=6)
SVG(ts.draw_svg())
Variable recombination rates and constant gene conversion rates can also be combined.
Note
Variable rates of gene conversion breakpoint initiation are not currently supported, but are planned for a future release. If you are interested in this feature please let us know!
Discrete or continuous?#
By default, we assume that the genome we are simulating is discrete so that genome coordinates are at integer positions:
ts = msprime.sim_ancestry(
3, recombination_rate=0.1, sequence_length=10, random_seed=2)
SVG(ts.draw_svg())
Alternatively, we can simulate a continous genome by setting
discrete_genome=False
:
ts = msprime.sim_ancestry(
3, recombination_rate=0.25, sequence_length=1, discrete_genome=False,
random_seed=33)
SVG(ts.draw_svg())
Here we see that the breakpoints along the genome occur at floating point positions. Simulating a continuous genome sequence can be useful for theoretical work, but we recommend using discrete coordinates for most purposes.
Multiple chromosomes#
Do I need to do this?#
The main purpose of simulating multiple chromosomes together (instead of running independent simulations for each chromosome) is to capture correlations between trees on distinct chromosomes. In many scenarios, simulating multiple chromosomes may not be necessary, and independent simulations may be preferable as they will be much more efficient.
That being said, this section describes how to simulate data over multiple chromosomes, or more generally, over multiple regions with free recombination between them.
Using a fixed pedigree#
The most natural way to simulate multiple regions with free recombination for a set of sampled individuals is by simulating through a known pedigree. The fixed pedigree simulation model constrains simulated genomic regions by an input pedigree, which specifies parent-offspring relationships. Since autosomal genetic material in an offspring has equal chance of being inherited from either their mother or their father, independent simulations through the same pedigree ensures that those regions are inherited from different parents or from the same parent each with probability 1/2. This ensures free recombination between independent simulations.
We often do not have a full pedigree, but instead have a pedigree that extends some number of generations into the past. Because a recombination fraction of 1/2 between regions causes correlations between chromosomes to decay rapidly, even shallow pedigrees containing just a handful of generations should broadly capture expected levels of correlations. At the termination of the pedigree, the simulation can be completed using the DTWF, Hudson, or other model. See example 1 below for a small example of this procedure, and more details on completing pedigree simulations can be found in the Fixed pedigree section.
Why might simulating multiple chromosomes using a pedigree be preferable to other simulation options? First, there is no need to concatenate recombination maps for multiple chromosomes, as each region is simulated independently. This also results in simpler, reusable code which can be parallelized. Finally, this approach improves efficiency over a single larger simulation of multiple concatenated chromosomes, as computational costs scale as the square of the sequence length.
Simulations without a fixed pedigree#
Without a pedigree, we can use other simulation models in msprime such as DTWF and the multiple merger coalescent. These models do not directly support simulating multiple chromosomes simultaneously, but we can emulate it using a single linear genome split into chromosome segments. In this case, multiple chromosomes are modelled by specifying a recombination map with single base-pair segments with recombination probability 1/2 separating adjacent chromosomes. The probability that two chromosomes are co-inherited across \(t\) generations is \(2^{-t}\). In the Poisson model of recombination used by msprime, the probability of co-inheritance across \(t\) generations of two adjacent base pairs separated by a recombination rate \(r\) per base pair is \(e^{-rt}\). To make these agree, we set the recombination rate in the base pair segments separating chromosomes to \(\log(2)\).
Nelson et al. 2020 describes a few cases where simulating multiple chromosomes using the DTWF is important for accurately modeling shared patterns of variation between samples. In large simulated cohorts, where many samples can be close relatives, we need the DTWF model and multiple chromosomes to obtain the correct distribution of the number and total length of shared IBD segments. We similarly require multiple chromosomes and the DTWF model to get the correct ancestry patterns for very recently admixed populations.
Additionally, Multiple merger coalescents result in positively correlated ancestries between unlinked chromosomes (see Birkner et al. 2013). This correlation does not break down in time and multiple chromosomes should not be simulated independently under the multiple merger models. Unlinked chromosomes should scatter into distinct ancestors instantaneously under multiple merger coalescents, not with probability 1/2 per generation as in discrete population models. In msprime, this waiting time of length zero is approximated by an exponential distribution with rate \(r\), i.e. the recombination rate at the base pair separating the chromosomes, which should be set to a high enough value to obtain an acceptable approximation. We recommend considering in particular the relative error in comparison to the magnitudes other waiting times which are likely to arise in the simulation.
Important
Simulations of multiple chromosomes under either DTWF or the multiple merger models should use a discrete genome. While it is possible to set up such a simulation using a continuous genome, we strongly recommend against it, as chromosome divisions cannot be defined as discrete break points and the simulation will be very inefficient.
Example 1: simulating with a pedigree#
In this example, we simulate genomic data for a pair of first cousins in a small pedigree extending 3 generations into the past. See here for details on specifying, importing, and simulating using pedigrees.
Let’s first specify our pedigree:
import io
import tskit
ped_txt = """\
# id parent0 parent1 time is_sample
0 2 3 0.0 1
1 4 5 0.0 1
2 6 7 1.0 0
3 8 9 1.0 0
4 6 7 1.0 0
5 10 11 1.0 0
6 . . 2.0 0
7 . . 2.0 0
8 . . 2.0 0
9 . . 2.0 0
10 . . 2.0 0
11 . . 2.0 0
"""
pedigree = msprime.parse_pedigree(io.StringIO(ped_txt), sequence_length=100)
draw_pedigree(pedigree.tree_sequence())
Then suppose we would like to simulate three chromosomes of differing lengths and recombination rates. In practice, we might have recombination maps for each chromosome, but the general strategy remains the same.
Ls = [1000000, 2000000, 3000000]
rs = [1e-8, 2e-8, 3e-8]
ts_chroms = []
pedigree = msprime.parse_pedigree(io.StringIO(ped_txt), sequence_length=1)
for i, (L, r) in enumerate(zip(Ls, rs)):
pedigree.sequence_length = L
ped_ts = msprime.sim_ancestry(
initial_state=pedigree, model="fixed_pedigree",
recombination_rate=r, random_seed=i+1)
ts_chroms.append(
msprime.sim_ancestry(
initial_state=ped_ts, population_size=1000,
recombination_rate=r, model="dtwf", random_seed=i+100))
for i, ts in enumerate(ts_chroms):
print(f"chromosome {i} has length {ts.sequence_length} and {ts.num_trees} trees")
chromosome 0 has length 1000000.0 and 48 trees
chromosome 1 has length 2000000.0 and 155 trees
chromosome 2 has length 3000000.0 and 377 trees
This results in three chromosomes of the desired lengths.
Example 2: simulating without a pedigree#
We first set up the chromosome coordinates, recombination rates, and the corresponding recombination map. The following defines a recombination map for three chromosomes each 1 cM in length:
import math
r_chrom = 1e-8
r_break = math.log(2)
chrom_positions = [0, 1e6, 2e6, 3e6]
map_positions = [
chrom_positions[0],
chrom_positions[1],
chrom_positions[1] + 1,
chrom_positions[2],
chrom_positions[2] + 1,
chrom_positions[3]
]
rates = [r_chrom, r_break, r_chrom, r_break, r_chrom]
rate_map = msprime.RateMap(position=map_positions, rate=rates)
We cannot use the default StandardCoalescent
model to run simulations
of multiple chromosomes; doing so would be equivalent to running independent
replicate simulations and much less efficient. To get the correct
correlation between chromosomes, we must use a model like
discrete-time Wright-Fisher.
However, because correlations
between chromosomes break down very rapidly, we only need to simulate using
dtwf
for roughly 10-20 generations, after which we can switch to the hudson
model to finish the simulation more efficiently (see the
Specifying ancestry models section for more information on switching
between ancestry models).
In this example, we simulate 10 sampled diploid individuals using the
recombination map defined above for three chromosomes and switching from the
dtwf
to hudson
model 20 generations ago:
ts = msprime.sim_ancestry(
10,
population_size=1000,
recombination_rate=rate_map,
model=[
msprime.DiscreteTimeWrightFisher(duration=20),
msprime.StandardCoalescent(),
],
random_seed=1234,
)
ts
Tree Sequence | |
---|---|
Trees | 351 |
Sequence Length | 3000000.0 |
Time Units | generations |
Sample Nodes | 20 |
Total Size | 60.3 KiB |
Metadata | No Metadata |
Table | Rows | Size | Has Metadata |
---|---|---|---|
Edges | 1254 | 39.2 KiB | |
Individuals | 10 | 304 Bytes | |
Migrations | 0 | 8 Bytes | |
Mutations | 0 | 16 Bytes | |
Nodes | 342 | 9.4 KiB | |
Populations | 1 | 224 Bytes | ✅ |
Provenances | 1 | 1.4 KiB | |
Sites | 0 | 16 Bytes |
To place each chromosome in its own separate tree sequence, we can trim the full tree
sequence for each chromosome by defining the end points of each chromosome and
using the keep_intervals()
method. It is important to
specify simplify=False
so that node indices remain consistent across
chromosomes. The trim()
method is then used to trim
the tree sequences to the focal chromosome, so that position indices begin at
0 within each chromosome tree sequence.
ts_chroms = []
for j in range(len(chrom_positions) - 1):
start, end = chrom_positions[j: j + 2]
chrom_ts = ts.keep_intervals([[start, end]], simplify=False).trim()
ts_chroms.append(chrom_ts)
print(chrom_ts.sequence_length)
# the third chromosome
chrom_ts
1000000.0
1000000.0
1000000.0
Tree Sequence | |
---|---|
Trees | 102 |
Sequence Length | 1000000.0 |
Time Units | generations |
Sample Nodes | 20 |
Total Size | 26.7 KiB |
Metadata | No Metadata |
Table | Rows | Size | Has Metadata |
---|---|---|---|
Edges | 371 | 11.6 KiB | |
Individuals | 10 | 304 Bytes | |
Migrations | 0 | 8 Bytes | |
Mutations | 0 | 16 Bytes | |
Nodes | 342 | 9.4 KiB | |
Populations | 1 | 224 Bytes | ✅ |
Provenances | 3 | 2.3 KiB | |
Sites | 0 | 16 Bytes |
This gives us a list of tree sequences, one for each chromosome, in the order that they were stitched together in the initial recombination map.
Note
Many recombination maps inferred from data will contain regions at the start and end of the chromosome where recombination rates could not be inferred. These regions typically are assigned rate values of NaN. If we were to concatenate multiple recombination maps that have flanking NaN regions, the resulting maps would have regions with NaN rates in the interior of the map, which will produce an error when using such a map in simulations.
These regions typically represent unassay-able segments of the chromosome, and
simulated genetic material in these regions should be discarded in downstream
analyses. Thus, the suggested approach is to modify the recombination maps to
set rates to zero in these flanking regions, concatenate the maps as shown in
this example, and then remove those regions of the resulting tree sequences by
calling ts.delete_intervals(nan_intervals)
, where nan_intervals
is a list
containing [start, end]
values for each NaN flanking region.
Recording more information#
By default msprime
stores the minimum amount of information required
to represent the simulated genealogical history of the samples. Sometimes
we are interested in more detailed information; this section gives details
of options that allow us to do this.
Additional nodes#
In msprime
we usually want to simulate the coalescent with recombination
and represent the output as efficiently as possible. As a result, we don’t
store individual recombination events, but rather their effects on the output
tree sequence. We also do not explicitly store common ancestor events that
do not result in marginal coalescences. For some purposes, however,
we want record information on other events of interest, not just the mimimal
representation of its outcome.
The additional_nodes
and
coalescing_segments_only options serve
this exact purpose. These options allow us to record the nodes associated with
a custom subset of all events we might observe in the history of a sample.
Besides samples and coalescence events, nodes can now also represent
common ancestor events,
recombination,
gene conversion,
migration,
and pass through events. The example below
and the next few paragraphs provide a guide on how
to record additional nodes and interpret the resulting node tables.
We first set up a simple pedigree simulation. Pedigrees are an interesting starting point as in contrast to the coalescent models a single node might be associated with more than one type of event. Conversely, for the continuous-time coalescent model simulated by default by msprime, each event is registered as a new node.
pb = msprime.PedigreeBuilder()
mp_21 = pb.add_individual(time=2)
dp_21 = pb.add_individual(time=2)
mp_11 = pb.add_individual(time=1, parents=[mp_21, dp_21])
dp_11 = pb.add_individual(time=1, parents=[mp_21, dp_21])
mp_12 = pb.add_individual(time=1, parents=[mp_21, dp_21])
dp_12 = pb.add_individual(time=1, parents=[mp_21, dp_21])
pb.add_individual(time=0, parents=[mp_11, dp_12], is_sample=True)
pb.add_individual(time=0, parents=[mp_11, dp_12], is_sample=True)
pb.add_individual(time=0, parents=[mp_12, dp_11], is_sample=True)
pedigree_tables = pb.finalise(sequence_length=5)
draw_pedigree(pedigree_tables.tree_sequence())
To specify the particular node types we want to store information on, pass a
NodeType
object to the additional_nodes
option of
sim_ancestry()
. This class extends enum.Flag
.
Each node type in the enumeration is associated with a numerical constant.
Different node types can be combined and compared using
bitwise operations.
At the same time, these constant also function as the flags identifying
the type of each node in the nodes table. Here, we take the bitwise union
of three members of the enumeration:
msprime.NodeType.RECOMBINANT
, msprime.NodeType.PASS_THROUGH
and msprime.NodeType.COMMON_ANCESTOR
.
ts = msprime.sim_ancestry(
initial_state=pedigree_tables,
model='fixed_pedigree',
recombination_rate=0.1,
random_seed=45,
additional_nodes=(
msprime.NodeType.RECOMBINANT |
msprime.NodeType.PASS_THROUGH |
msprime.NodeType.COMMON_ANCESTOR
),
coalescing_segments_only=False
)
def count_flags(ts):
counter = dict()
for nodetype in msprime.NodeType:
flags = np.bitwise_and(ts.nodes_flags, nodetype.value)
counter[nodetype.name] = np.sum(flags > 0)
return counter
print(count_flags(ts))
{'RECOMBINANT': np.int64(4), 'COMMON_ANCESTOR': np.int64(0), 'MIGRANT': np.int64(0), 'CENSUS': np.int64(0), 'GENE_CONVERSION': np.int64(0), 'PASS_THROUGH': np.int64(7)}
The count_flags
function demonstrates the use of
bitwise operations to determine the type of a node. This function iterates
across all node types and checks for all rows in the nodes table whether
the intersection of the basic node type and the node within the nodes table
is different from 0. It returns a dict
containing the counts of all possible
node types.
Note that recombination events are
associated with two nodes,
one for both ends that are created by recombination backwards in time. The
number of recombination events is thus half the number of recombination nodes.
PASS_THROUGH
events occur when the ancestral material of only a single
lineage passes through the genome of an ancestor, making coalescence
impossible. This event can only be observed in models that track individuals:
msprime.DiscreteTimeWrightFisher
and
msprime.FixedPedigree
models.
node_style_map = {
msprime.NodeType.RECOMBINANT.value: 'yellow',
msprime.NodeType.COMMON_ANCESTOR.value: 'red',
msprime.NodeType.PASS_THROUGH.value: 'blue',
(msprime.NodeType.RECOMBINANT | msprime.NodeType.PASS_THROUGH).value : 'green',
(msprime.NodeType.RECOMBINANT | msprime.NodeType.COMMON_ANCESTOR).value : 'orange'
}
css_string = ""
for node in ts.nodes():
if node.flags in node_style_map:
css_string += ".n%s > .sym {fill: %s}" % (node.id, node_style_map[node.flags])
wide_fmt = (800, 250)
ts.draw_svg(style=css_string, size=wide_fmt)
Similarly, the bitwise operations logic can be used to color the different nodes
in the tree sequence according to their NodeType
. Node 5 is both
msprime.NodeType.RECOMBINANT
and
msprime.NodeType.PASS_THROUGH
(green). Nodes 0, 1, 4 are
recombinant nodes (yellow). All other nodes are only associated with a
PASS_THROUGH
event (blue). Note that we do not observe any COMMON_ANCESTOR
events. This means that all yellow nodes are also associated with a
coalesence event (see common ancestor events).
Note that in order to verify whether a node is RECOMBINANT
and PASS_THROUGH
,
we use the bitwise OR to define its associated NodeType
constant.
To summarize: additional_nodes
are specified using bitwise flags. Multiple
basic node types can be combined by taking the bitwise OR
(|) and then
passed to the additional_nodes
option. This enables us to track the specified
nodes across the genealogical history of the sample. By means of bitwise AND
(&) we can then query the nodes table and check the type of each of the recorded
nodes.
If we wish to reduce these trees down to the minimal representation, we can use
tskit.TreeSequence.simplify()
. The resulting tree sequence will have
all of these unary nodes removed and will be equivalent to (but not identical,
due to stochastic effects) calling sim_ancestry()
without the
additional_nodes
or coalescing_segment_only
argument(s).
Coalescing segments only#
Note that the coalescing_segments_only
option should be set to False
when
recording additional nodes. Setting coalescing_segments_only
to False
allows us to switch off the default simulator behaviour of only recording the
relationships between overlapping ancestry segments. Instead, you can now
record edges along the full extent of any of the ancestral lineages
that was involved in any of the events as specified by the additional_nodes
flag.
This option can also be set to False
without specifying any additional nodes.
When coalescing_segments_only
is set to False
,
edges that are a result of a coalescent event will record
the full length of the tracked ancestral material,
not just the overlapping segments of genome (as is the default behavior.
As a result, this option will produce
in nodes with only one child (unary nodes) along parts (unary regions) of
the genome.
For instance: suppose an ancestral segment ancestral to node m
spans from a
to b
along the genome,
and another overlapping segment ancestral to node n
spans from c
to d
, with a < c < b < d
.
If the two coalesce to create node p
, if coalescing_segments_only=True
(the default)
then edges from m
and n
to p
over the genomic segment [c,b)
would be created,
and the ancestry of the remaining segments (i.e., [a,c)
for m
and [b,d)
for n
) would be left unspecified
until further coalescent events.
However, if coalescing_segments_only=False
, then the edges recorded
would be from m
to p
over the entire segment [a, b)
, and from n
to p
over the entire segment [c,d)
.
The nodes m
and n
coalesce (in p
) on only the overlapping segment [c,b)
, and so the node p
will be a unary node in the flanking regions: above m
on the segment [a,c)
and above n
on the segment [b,d)
.
Common ancestor events#
We distinguish two types of common ancestor events: coalescence and common ancestor (in the strict sense) events.
Coalescence events are the default node type (associated with value 0) and are
therefore only implicitly part of the NodeType
enumeration.
In contrast, whenever two nonoverlapping ancestral segments
are found to have inherited from the same ancestral genome,
there is no coalescence event, and so we register this in the node table as
msprime.NodeType.COMMON_ANCESTOR
.
Finally, when keeping track of individuals
(msprime.DiscreteTimeWrightFisher
,
msprime.FixedPedigree
), we might observe genomes (ploids) through
which the ancestral material of only a single lineage passes.
Such genomes can be registered in the table as
msprime.NodeType.PASS_THROUGH
nodes.
Although this is obviously not a common ancestor event, note that for both
of these models each node has to carry at least one of these three flags.
Recombination events#
Additional nodes can also be used to mark recombination events (cross over) as well
as gene conversion. A seperate msprime.NodeType
exists for each:
msprime.NodeType.RECOMBINANT
and msprime.NodeType.GENE_CONVERSION
.
A recombination event results in two extra nodes being recorded, one identifying the genome providing the genetic material to the left of the breakpoint and the other identifying the genome providing the genetic material to the right. For more information on how to set the recombination rate or specify a recombination map: see Recombination. More information on how to set up a simulation with gene conversion can be found here: Gene conversion.
Migration events#
When running simulations with demography, by
default, migration events are not recorded in the resulting tree sequence.
While the population associated with each node is determined by the simulated migration
events, knowledge of the
precise sequence of migrations is useful in a number of situations: for example see
the tskit
tutorial on introgression.
The record_migrations
parameter allows us to record these events in the
simulated tree sequence outputted by sim_ancestry()
. Briefly, a migration record
encodes the node associated with the migration, the time of the event, the source and
destination populations (backwards in time), and the left and right coordinates of the
migrating segement. For more details on migration records, see the
migration table definition.
Here, we provide a simple example of the effect of setting record_migrations=True
using the stepping stone model
where migration is permitted between adjacent populations. Additionally, we
set additional_nodes=msprime.NodeType.MIGRANT
(see previous section)
to record nodes corresponding to migration events. This is not necessary, but will be
helpful to visualise the result.
N = 10
demography = msprime.Demography.stepping_stone_model(
[20] * N, migration_rate=0.1, boundaries=True
)
ts = msprime.sim_ancestry(
{0: 1, N - 1: 1},
demography=demography,
record_migrations=True,
additional_nodes=msprime.NodeType.MIGRANT,
coalescing_segments_only=False,
random_seed=6,
)
In the following visualisation, we trace the migration history of two samples backwards in time to their most recent common ancestor. Each one of the ten populations in the model is assigned a colour and a position on the x axis. The time of migration events is recorded on the y axis.
import tskit
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
cmap = plt.colormaps['viridis']
tree = ts.first()
mig_color_dict = {}
for u, linestyle in zip(ts.samples()[::2], ["dotted", "dashdot"]):
mig = ts.tables.migrations
loc = []
time = []
while u != tskit.NULL:
migs = np.where(mig.node == u)[0]
for cur_mig in migs:
cur_mig = mig[cur_mig]
loc.append(cur_mig.dest + 0.5)
time.append(cur_mig.time)
u = tree.parent(u)
plt.plot(loc, time, linestyle=linestyle, color="black")
for i in range(N):
rgba = cmap(i / N)
mig_color_dict[i] = np.array(rgba)*255
plt.axvspan(i, i+1, facecolor=rgba, alpha=0.5)
plt.ylabel("Time of migration event")
plt.xlabel("Population")
plt.show()
Note also how many migration events occurred before the two samples coalesced, even in this small example.
The next visualisation shows where each of these migrations fall on the edges of the simulated tree. We show the nodes associated with each migration event, colouring nodes and edges by the population with which they are associated (using the same colour scheme as the previous visualisation).
mig_color_dict = {}
cmap = plt.colormaps['viridis']
for i in range(N):
rgba = cmap(i / N)
mig_color_dict[i] = np.array(rgba) * 255
edge_colors = {}
string = ".tree .edge {stroke-width:4px}"
pop = ts.tables.nodes.population
for node in ts.nodes():
node = node.id
edge_colors[node] = mig_color_dict[pop[node]]
string += (
".node.n"
+ str(node)
+ "> .sym {fill: rgb"
+ str(tuple(mig_color_dict[pop[node]][0:3]))
+ "}"
)
string += (
".node.n"
+ str(node)
+ " > .edge {stroke: rgb"
+ str(tuple(mig_color_dict[pop[node]][0:3]))
+ "}"
)
for mig in ts.migrations():
edge_colors[mig.node] = mig_color_dict[mig.source]
string += (
".node.n"
+ str(mig.node)
+ "> .sym {fill: rgb"
+ str(tuple(mig_color_dict[pop[mig.node]][0:3]))
+ "}"
)
string += (
".node.n"
+ str(mig.node)
+ " > .edge {stroke: rgb"
+ str(tuple(mig_color_dict[mig.dest][0:3]))
+ "}"
)
node_labels = {node.id: "" for node in ts.nodes()}
SVG(
ts.first().draw_svg(
size=(500, 400), style=string, node_labels=node_labels, symbol_size=10
)
)
Census events#
Census events allow you to add a node to each branch of the tree sequence at a given time during the simulation. This can be useful when you wish to study haplotypes that are ancestral to your simulated sample, or when you wish to know which lineages were present in which populations at specified times.
For instance, the following example creates a two population island
model Demography
, with the populations exchanging migrants
at rate 0.05. At generation 5000, we add a census event using the
Demography.add_census()
method
to determine where each of the lineages is at that time:
demography = msprime.Demography.island_model([1000, 1000], migration_rate=0.05)
demography.add_census(time=5000)
ts = msprime.sim_ancestry(
{0: 1, 1: 1},
demography=demography,
recombination_rate=1e-7,
sequence_length=1000,
random_seed=141
)
css_string = ""
for node in ts.nodes():
if node.time == 5000:
css_string += ".n%s > .sym {fill: %s}" % (node.id, 'green')
SVG(ts.draw_svg(style=css_string))
The resulting tree sequence has nodes on each tree at the specified census time. These are the nodes with IDs 8, 9, 10, 11, 12 and 13.
This tells us that the genetic material ancestral to the present day sample was held within 5 haplotypes at time 5000. The node table shows us that four of these haplotypes (nodes 8, 9, 10 and 11) were in population 0 at this time, and two of these haplotypes (nodes 12 and 13) were in population 1 at this time.
print(ts.tables.nodes)
census_nodes = np.bitwise_and(ts.nodes_flags, msprime.NodeType.CENSUS.value) > 0
print(ts.tables.nodes[census_nodes])
╔══╤═══════╤══════════╤══════════╤═════════════╤════════╗
║id│flags │population│individual│time │metadata║
╠══╪═══════╪══════════╪══════════╪═════════════╪════════╣
║0 │ 1│ 0│ 0│ 0.00000000│ ║
║1 │ 1│ 0│ 0│ 0.00000000│ ║
║2 │ 1│ 1│ 1│ 0.00000000│ ║
║3 │ 1│ 1│ 1│ 0.00000000│ ║
║4 │ 0│ 1│ -1│2331.91109774│ ║
║5 │ 0│ 1│ -1│3741.02811877│ ║
║6 │ 0│ 0│ -1│4216.80416680│ ║
║7 │ 0│ 1│ -1│4580.66488183│ ║
║8 │1048576│ 0│ -1│5000.00000000│ ║
║9 │1048576│ 0│ -1│5000.00000000│ ║
║10│1048576│ 0│ -1│5000.00000000│ ║
║11│1048576│ 0│ -1│5000.00000000│ ║
║12│1048576│ 1│ -1│5000.00000000│ ║
║13│1048576│ 1│ -1│5000.00000000│ ║
║14│ 0│ 1│ -1│5232.65267392│ ║
║15│ 0│ 0│ -1│8192.48105714│ ║
╚══╧═══════╧══════════╧══════════╧═════════════╧════════╝
╔══╤═══════╤══════════╤══════════╤════╤════════╗
║id│flags │population│individual│time│metadata║
╠══╪═══════╪══════════╪══════════╪════╪════════╣
║0 │1048576│ 0│ -1│5000│ ║
║1 │1048576│ 0│ -1│5000│ ║
║2 │1048576│ 0│ -1│5000│ ║
║3 │1048576│ 0│ -1│5000│ ║
║4 │1048576│ 1│ -1│5000│ ║
║5 │1048576│ 1│ -1│5000│ ║
╚══╧═══════╧══════════╧══════════╧════╧════════╝
If we wish to study these ancestral haplotypes further, we can simplify the tree sequence
with respect to the census nodes and perform subsequent analyses on this simplified tree
sequence.
In this example, ts_anc
is a tree sequence obtained from the original tree sequence
ts
by labelling the census nodes as samples and removing all nodes and edges that are
not ancestral to these census nodes.
nodes = np.where(census_nodes)[0]
print(nodes)
ts_anc = ts.simplify(samples=nodes)
[ 8 9 10 11 12 13]
Using the DTWF model#
Census events assume that they are being used as part of a continuous time
simulation process, in which multiple events are guaranteed not to happen
at the same time. However, the DiscreteTimeWrightFisher
model does
not meet these assumptions, and allows multiple events (such as coalescences,
migrations etc) to all occur at the same time. This includes census events,
which can lead to errors when used in conjunction with the DTWF model.
Consider the following example:
demography = msprime.Demography()
demography.add_population(name="A", initial_size=10)
ts = msprime.sim_ancestry({"A": 3}, demography=demography, model="dtwf", random_seed=33)
SVG(ts.draw_svg(y_axis=True, time_scale="log_time"))
We can see that a coalescence happens at time 3, giving node 7. Now, let’s try to perform a census at time 3:
demography = msprime.Demography()
demography.add_population(name="A", initial_size=10)
demography.add_census(time=3)
ts = msprime.sim_ancestry({"A": 3}, demography=demography, model="dtwf", random_seed=33)
---------------------------------------------------------------------------
LibraryError Traceback (most recent call last)
Cell In[47], line 4
2 demography.add_population(name="A", initial_size=10)
3 demography.add_census(time=3)
----> 4 ts = msprime.sim_ancestry({"A": 3}, demography=demography, model="dtwf", random_seed=33)
File ~/work/tskit-site/tskit-site/msprime/ancestry.py:1311, in sim_ancestry(samples, demography, sequence_length, discrete_genome, recombination_rate, gene_conversion_rate, gene_conversion_tract_length, population_size, ploidy, model, initial_state, start_time, end_time, record_migrations, record_full_arg, additional_nodes, coalescing_segments_only, num_labels, random_seed, num_replicates, replicate_index, record_provenance)
1289 provenance_dict = provenance.get_provenance_dict(parameters)
1290 sim = _parse_sim_ancestry(
1291 samples=samples,
1292 sequence_length=sequence_length,
(...)
1309 random_seed=random_seed,
1310 )
-> 1311 return _wrap_replicates(
1312 sim,
1313 num_replicates=num_replicates,
1314 replicate_index=replicate_index,
1315 provenance_dict=provenance_dict,
1316 )
File ~/work/tskit-site/tskit-site/msprime/ancestry.py:681, in _wrap_replicates(simulator, num_replicates, replicate_index, provenance_dict, mutation_rate)
675 iterator = simulator.run_replicates(
676 num_replicates,
677 mutation_rate=mutation_rate,
678 provenance_dict=provenance_dict,
679 )
680 if replicate_index is not None:
--> 681 deque = collections.deque(iterator, maxlen=1)
682 return deque.pop()
683 else:
File ~/work/tskit-site/tskit-site/msprime/ancestry.py:1606, in Simulator.run_replicates(self, num_replicates, mutation_rate, provenance_dict)
1604 for replicate_index in range(num_replicates):
1605 logger.info("Starting replicate %d", replicate_index)
-> 1606 self.run()
1607 if mutation_rate is not None:
1608 # This is only called from simulate() or the ms interface,
1609 # so does not need any further parameters.
1610 mutations._simple_mutate(
1611 tables=self.tables,
1612 random_generator=self.random_generator,
1613 sequence_length=self.sequence_length,
1614 rate=mutation_rate,
1615 )
File ~/work/tskit-site/tskit-site/msprime/ancestry.py:1571, in Simulator.run(self, event_chunk, debug_func)
1569 raise ValueError("Model durations must be >= 0")
1570 end_time = min(self.time + model_duration, self.end_time)
-> 1571 exit_reason = self._run_until(end_time, event_chunk, debug_func)
1572 if exit_reason == ExitReason.COALESCENCE or self.time == self.end_time:
1573 logger.debug("Skipping remaining %d models", len(self.models) - j - 1)
File ~/work/tskit-site/tskit-site/msprime/ancestry.py:1536, in Simulator._run_until(self, end_time, event_chunk, debug_func)
1534 ret = ExitReason.MAX_EVENTS
1535 while ret == ExitReason.MAX_EVENTS:
-> 1536 ret = ExitReason(super().run(end_time, event_chunk))
1537 if self.time > end_time:
1538 # Currently the Pedigree and Sweeps models are "non-reentrant"
1539 # We can change this to an assertion once these have been fixed.
1540 raise RuntimeError(
1541 f"Model {self.model['name']} does not support interruption. "
1542 "Please open an issue on GitHub"
1543 )
LibraryError: The simulation model supplied resulted in a parent node having a time value <= to its child. This can occur as a result of multiple bottlenecks happening at the same time, multiple census events at the same time or numerical imprecision with very smallpopulation sizes.
We got an error from msprime about a parent node having a time value <= to its child, which occurred because we tried to add set of census nodes at time 3, but node 7 was already at time 3, giving a zero branch length (which is disallowed by tskit).
The solution is to use non-integer time values when performing a census under the DTWF:
demography = msprime.Demography()
demography.add_population(name="A", initial_size=10)
demography.add_census(time=3.5)
ts = msprime.sim_ancestry({"A": 3}, demography=demography, model="dtwf", random_seed=33)
SVG(ts.draw_svg(y_axis=True, time_scale="log_time"))
Ancestral recombination graph#
This is a legacy option and identical to setting the additional_nodes
option to store
common_ancestor events, recombinants, (and migrants if applicable). This is illustrated
in the following example:
ts = msprime.sim_ancestry(
3, recombination_rate=0.1, sequence_length=2,
record_full_arg=True, random_seed=42)
additional_nodes = msprime.NodeType.COMMON_ANCESTOR | msprime.NodeType.RECOMBINANT
ts_other = msprime.sim_ancestry(
3, recombination_rate=0.1, sequence_length=2,
additional_nodes=additional_nodes,
coalescing_segments_only=False,
random_seed=42)
print(ts.tables.equals(ts_other.tables, ignore_provenance=True))
print(ts.tables.nodes)
SVG(ts.draw_svg())
True
╔══╤══════╤══════════╤══════════╤══════════╤════════╗
║id│flags │population│individual│time │metadata║
╠══╪══════╪══════════╪══════════╪══════════╪════════╣
║0 │ 1│ 0│ 0│0.00000000│ ║
║1 │ 1│ 0│ 0│0.00000000│ ║
║2 │ 1│ 0│ 1│0.00000000│ ║
║3 │ 1│ 0│ 1│0.00000000│ ║
║4 │ 1│ 0│ 2│0.00000000│ ║
║5 │ 1│ 0│ 2│0.00000000│ ║
║6 │ 0│ 0│ -1│0.21230674│ ║
║7 │ 0│ 0│ -1│0.51485157│ ║
║8 │ 0│ 0│ -1│0.71161449│ ║
║9 │131072│ 0│ -1│0.91107705│ ║
║10│131072│ 0│ -1│0.91107705│ ║
║11│262144│ 0│ -1│1.21743776│ ║
║12│ 0│ 0│ -1│1.23130397│ ║
║13│ 0│ 0│ -1│4.80416300│ ║
╚══╧══════╧══════════╧══════════╧══════════╧════════╝
After running the simulation we first print out the
node table, which
contains information on all the nodes in the tree sequence. Note that flags
column contains several different values: all of the sample nodes (at time 0)
have a flag value of 1
(tskit.NODE_IS_SAMPLE
). Most other
nodes have a flag value of 0
, which is the standard for internal nodes
in a coalescent simulations.
Nodes 9 and 10 have flags equal to 131072 (NodeType.RECOMBINANT
), which
tells us that they correspond to a recombination event in the ARG. A
recombination event results in two extra nodes being recorded, one identifying
the individual providing the genetic material to the left of the breakpoint and
the other identifying the individuals providing the genetic material to the
right. The effect of this extra node can be seen in the trees: node 9 is
present as a ‘unary’ node in the left hand tree and node 10 in the right.
In this particular case, the first thing that happens to these two lineages
as we go up the tree sequence is that they coalesce straight back together again
at node 11, forming a (normally undetectable) “diamond event” in the ARG, and
explaining why the topology of both these trees appears the same. In a full ARG,
this is one of many ways that a coalescent event can occur without resulting in a
marginal coalescence, and which instead results in an additional unary node in the
trees. Such nodes are given a flags value of 262144 (NodeType.COMMON_ANCESTOR
).
If we wish to reduce these trees down to the minimal representation, we can
use tskit.TreeSequence.simplify()
. The resulting tree sequence will have
all of these unary nodes removed and will be equivalent to (but not identical, due to
stochastic effects) calling sim_ancestry()
without the record_full_arg
argument.
Migrations nodes are also recorded in the ARG using the
NodeType.MIGRANT
flag. See the Node flags
section for more details.
Manipulating simulation time#
Stopping simulations early#
In most simulations we want to simulate a complete history for our samples: that is, to when we have a common ancestor at every point along the sequence. For example, here we generate a complete ancestral history for two diploid samples:
ts = msprime.sim_ancestry(2, sequence_length=10, recombination_rate=0.1, random_seed=42)
SVG(ts.draw_svg(y_axis=True, y_ticks=[0, 1, 2, 3, 4, 5]))
Sometimes we would like to stop the simulation early, before complete coalescence has occurred; perhaps we would like to combine several simulations with different properties, or we are really only interested in the recent past and simulating more ancient processes would be a waste of computational resources.
We can do this by specifying a end_time
value.
Let’s repeat the simulation as above with an end_time
of 2:
ts = msprime.sim_ancestry(
2, sequence_length=10, recombination_rate=0.1, end_time=2, random_seed=42)
SVG(ts.draw_svg(y_axis=True, y_ticks=[0, 1, 2]))
There are a number of important things to observe about this tree sequence:
The history up until time 2 is identical to the original simulation.
The first tree has not fully coalesced, and we therefore have multiple roots. Trees with multiple roots are fully supported in tskit.
The nodes 8 and 9 are at time 2 (our
end_time
value), and these connect to the uncoalesced portions of the tree. These “unary” nodes are very important for ensuring the correct statistical properties when we combine multiple simulations.
See also
The ancestry model duration can also be used to stop simulations before coalescence.
For discrete time models, the exact interpretation
of the end_time
is important — we continue to simulate
the process while the time is <= end_time
. For example here
we use an end_time
of 1 generation in a Discrete Time Wright-Fisher
simulation:
ts = msprime.sim_ancestry(3, population_size=10, model="dtwf", end_time=1, random_seed=42)
SVG(ts.draw_svg(y_axis=True))
We have run 1 generation of the DTWF, resulting in the coalescence at node 9.
Setting the start time#
Important
Setting simulation start time is an advanced feature which is rarely needed in practice.
Sometimes we need control when the simulation process starts. For example, given the initial configuration of at time zero we want to “jump ahead” to a later time point before we start generating random events. Consider the following example:
ts = msprime.sim_ancestry(2, random_seed=42)
ts.tables.nodes
id | flags | population | individual | time | metadata |
---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0.00000000 | |
1 | 1 | 0 | 0 | 0.00000000 | |
2 | 1 | 0 | 1 | 0.00000000 | |
3 | 1 | 0 | 1 | 0.00000000 | |
4 | 0 | 0 | -1 | 0.15642269 | |
5 | 0 | 0 | -1 | 0.29152170 | |
6 | 0 | 0 | -1 | 2.11740681 |
Now we run the same simulation with a start_time
of 10:
ts = msprime.sim_ancestry(2, start_time=10, random_seed=42)
ts.tables.nodes
id | flags | population | individual | time | metadata |
---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0.00000000 | |
1 | 1 | 0 | 0 | 0.00000000 | |
2 | 1 | 0 | 1 | 0.00000000 | |
3 | 1 | 0 | 1 | 0.00000000 | |
4 | 0 | 0 | -1 | 10.15642269 | |
5 | 0 | 0 | -1 | 10.29152170 | |
6 | 0 | 0 | -1 | 12.11740681 |
We can see that the results are identical except that 10 has been added to all of the internal node times.
Important
It is important to note that the start_time
only controls the
“random” events that happen within the simulation model: events
that are specified to occur at a given time (such as sampling
or demographic events) will occur at that time, even if this
is before the start_time
.
Consider the following example in which we create a simple
population tree
demography, and then take samples at different times all before
start_time
:
demography = msprime.Demography()
demography.add_population(name="A", initial_size=100)
demography.add_population(name="B", initial_size=100)
demography.add_population(name="C", initial_size=100)
demography.add_population_split(time=5, derived=["A", "B"], ancestral="C")
ts = msprime.sim_ancestry(
samples=[
msprime.SampleSet(1, population="A", time=1),
msprime.SampleSet(1, population="B", time=2),
msprime.SampleSet(1, population="C", time=6),
],
demography=demography,
ploidy=1,
start_time=10,
random_seed=1)
ts.tables.nodes
id | flags | population | individual | time | metadata |
---|---|---|---|---|---|
0 | 1 | 0 | 0 | 1.00000000 | |
1 | 1 | 1 | 1 | 2.00000000 | |
2 | 1 | 2 | 2 | 6.00000000 | |
3 | 0 | 2 | -1 | 52.47084127 | |
4 | 0 | 2 | -1 | 66.18169933 |
We can see that the samples we specify are in the correct populations and their node times are at the sampling time. Note that all the coalescences happen in population ID 2 (“C”), indicating that the lineages from “A” and “B” moved there at the time of the split (which we can observe directly using the record_migrations option if we wish). Note also that we drew a sample from population “C” at time 6, which is after the split; if we tried to draw a sample before the split, we would get an error because the population would be inactive.
See also
See the Order of event execution section for details of the precise order in which fixed-time events occur.
Note
The same considerations apply when we are running simulations based
on a given initial state; in
this case, the start_time
is by default set to the time
of the youngest root in the input trees.
Order of event execution#
There are two different classes of event that happen in ancestry simulations: “fixed” and “random” events. Random events are generated by the ancestry model, and fixed events are those that occur at a time specified by the user. Fixed events are demographic events or sampling events. If multiple events are specified to occur at exactly the same time, then:
All demographic events at this time are run, in the order they were specified.
Sampling events occur immediately after all the demographic events have been executed. (See the Sampling time section for more details on ancient sampling events.)
These details are slightly different for the Discrete Time Wright-Fisher
model. See the API documentation
for
details.
Specifying the initial state#
By default msprime
simulations are initialised by specifying a set of
samples,
which sets up the simulation with segments of ancestral material covering the
whole sequence. Internally, this initial state is implemented by
setting up a tskit.TableCollection
which contains the
population,
node,
and individual information.
The low-level simulation code then reads these tables and sets up the
simulation accordingly. We can see what this initial state looks like
by setting up a simple simulation and stopping it immediately:
initial_ts = msprime.sim_ancestry(1, end_time=0)
initial_ts
Tree Sequence | |
---|---|
Trees | 1 |
Sequence Length | 1.0 |
Time Units | generations |
Sample Nodes | 2 |
Total Size | 1.4 KiB |
Metadata | No Metadata |
Table | Rows | Size | Has Metadata |
---|---|---|---|
Edges | 0 | 8 Bytes | |
Individuals | 1 | 52 Bytes | |
Migrations | 0 | 8 Bytes | |
Mutations | 0 | 16 Bytes | |
Nodes | 2 | 64 Bytes | |
Populations | 1 | 224 Bytes | ✅ |
Provenances | 1 | 1008 Bytes | |
Sites | 0 | 16 Bytes |
We can see that the returned tree sequence has one population, two sample nodes and one individual (since sample individuals are diploid by default). There are no edges because the simulation didn’t run for any period of time: we’re essentially just returning the initial state used to start the simulation.
Then, running a simulation in which the initial_state
is equal to
initial_ts
is precisely the same as running the simulation
with the same arguments as we used above:
ts1 = msprime.sim_ancestry(initial_state=initial_ts, population_size=100, random_seed=2)
ts2 = msprime.sim_ancestry(1, population_size=100, random_seed=2)
assert ts1.equals(ts2, ignore_provenance=True)
Warning
Note that the initial_state
tree sequence only contains the
information about the ancestral histories, and not any of the
simulation parameters. In the previous example we used default
parameters to keep things simple, but in general the full
simulation model will need to be specified as well as the initial state.
In particular, see the Interaction with demography section.
While it is possible to specify the initial state manually by
constructing the input tree sequence,
it is an advanced topic and rarely necessary. More often we are
interested in specifying the initial state of an msprime
simulation using the output of another simulation.
Note
The initial_state
tree sequence is quite literally the initial
state of the tree sequence that is returned at the end of a simulation,
and that any information present (including metadata) will not be
altered.
Combining backward-time simulations#
Suppose that we wished to simulate a scenario
in which we have recombination happening at some rate in the recent past,
and in the more distant past there is no recombination.
We can do this by combining the simulations,
using one as the initial_state
for the other.
We first run the simulation of the recent past:
ts1 = msprime.sim_ancestry(
2, recombination_rate=0.01, sequence_length=10, end_time=10,
population_size=100, random_seed=1)
SVG(ts1.draw_svg(y_axis=True, time_scale="log_time", y_ticks=[0, 1, 2, 5, 10]))
Some recombination and coalescence has happened and we have three marginal trees along the genome. None of these trees has fully coalesced, and so each of the trees has multiple roots.
Important
The set of nodes at time 10 connecting to all the internal nodes in the trees are very important: without these nodes and unary edges, the statistical properties of the combined simulations would not be correct! (See also section below on forward simulations.)
We then run the next phase of the simulation in which there is
no recombination by passing ts1
as the argument to initial_state
:
ts2 = msprime.sim_ancestry(initial_state=ts1, population_size=100, random_seed=3)
SVG(ts2.draw_svg(y_axis=True, time_scale="log_time", y_ticks=[0, 10, 20, 50, 100, 200, 500]))
Since there is no recombination in this more ancient simulation,
node 12
is the MRCA in all three trees. The unary nodes
6
, 7
, 8
, and 9
marking the transition between the
two simulation regimes are still present in the trees. If you
wish to remove these, we can use the tskit.TreeSequence.simplify()
method:
ts_simplified = ts2.simplify()
SVG(ts_simplified.draw_svg())
Interaction with demography#
In the previous example we saw how to combine two single population simulations. We can also combine more complex simulations involving demography, but care needs to be taken to ensure that the correct models are simulated in each part of the simulation. Suppose we have simple model with a population split:
demography = msprime.Demography()
demography.add_population(name="A", description="Contemporary population A", initial_size=100)
demography.add_population(name="B", description="Contemporary population B", initial_size=200)
demography.add_population(name="C", description="Ancestral population", initial_size=300)
demography.add_population_split(time=100, ancestral="C", derived=["A", "B"])
demography
id | name | description | initial_size | growth_rate | default_sampling_time | extra_metadata |
---|---|---|---|---|---|---|
0 | A | Contemporary population A | 100.0 | 0 | 0 | {} |
1 | B | Contemporary population B | 200.0 | 0 | 0 | {} |
2 | C | Ancestral population | 300.0 | 0 | 1e+02 | {} |
time | type | parameters | effect |
---|---|---|---|
100 | Population Split | derived=[A, B], ancestral=C | Moves all lineages from derived populations 'A' and 'B' to the ancestral 'C' population. Also set the derived populations to inactive, and all migration rates to and from the derived populations to zero. |
Then, we run a simulation of this model for 50 generations:
ts1 = msprime.sim_ancestry(
samples={"A": 1, "B": 1}, demography=demography, end_time=50, random_seed=1234)
node_labels = {
node.id: f"{node.id}:{ts1.population(node.population).metadata['name']}"
for node in ts1.nodes()}
SVG(ts1.draw_svg(node_labels=node_labels, y_axis=True))
After 50 generations we can see there has been a coalescence in population A
but none in B
and we therefore have three lineages left when the simulation
finishes. We can then continue the simulation based on this initial state:
ts2 = msprime.sim_ancestry(
initial_state=ts1, demography=demography, random_seed=5678)
node_labels = {
node.id: f"{node.id}:{ts2.population(node.population).metadata['name']}"
for node in ts2.nodes()}
SVG(ts2.draw_svg(node_labels=node_labels, y_axis=True))
Note that we use the same demography object, and so the lineages
migrate and ultimately coalesce in population C
, as we’d expect.
This is the simplest case, in which we already have the demography
object which we can use to model the entire simulation from end-to-end.
In other cases (e.g. when working with simulations from a different
program), we don’t have the Demography
object at hand
and we need to create one that is both compatible with the
initial_state
tree sequence and reflects the demographic
model that we are interested in using. The best way to do this
is to use the Demography.from_tree_sequence()
method to
first get a base demography that is based on the
tree sequence population table
and metadata:
demography = msprime.Demography.from_tree_sequence(ts1)
demography
id | name | description | initial_size | growth_rate | default_sampling_time | extra_metadata |
---|---|---|---|---|---|---|
0 | A | Contemporary population A | 0.0 | 0 | 0 | {} |
1 | B | Contemporary population B | 0.0 | 0 | 0 | {} |
2 | C | Ancestral population | 0.0 | 0 | 0 | {} |
We can see that the population names and descriptions have been recovered from the tree sequence metadata but no other information: the population sizes are all zero and we no longer have a population split event. If we try to run a simulation using this demography we get an error:
ts = msprime.sim_ancestry(initial_state=ts1, demography=demography)
---------------------------------------------------------------------------
InputError Traceback (most recent call last)
Cell In[65], line 1
----> 1 ts = msprime.sim_ancestry(initial_state=ts1, demography=demography)
File ~/work/tskit-site/tskit-site/msprime/ancestry.py:1290, in sim_ancestry(samples, demography, sequence_length, discrete_genome, recombination_rate, gene_conversion_rate, gene_conversion_tract_length, population_size, ploidy, model, initial_state, start_time, end_time, record_migrations, record_full_arg, additional_nodes, coalescing_segments_only, num_labels, random_seed, num_replicates, replicate_index, record_provenance)
1265 parameters = dict(
1266 command="sim_ancestry",
1267 samples=samples,
(...)
1287 # replicate index is excluded as it is inserted for each replicate
1288 )
1289 provenance_dict = provenance.get_provenance_dict(parameters)
-> 1290 sim = _parse_sim_ancestry(
1291 samples=samples,
1292 sequence_length=sequence_length,
1293 recombination_rate=recombination_rate,
1294 gene_conversion_rate=gene_conversion_rate,
1295 gene_conversion_tract_length=gene_conversion_tract_length,
1296 discrete_genome=discrete_genome,
1297 population_size=population_size,
1298 demography=demography,
1299 ploidy=ploidy,
1300 model=model,
1301 initial_state=initial_state,
1302 start_time=start_time,
1303 end_time=end_time,
1304 record_migrations=record_migrations,
1305 record_full_arg=record_full_arg,
1306 additional_nodes=additional_nodes,
1307 coalescing_segments_only=coalescing_segments_only,
1308 num_labels=num_labels,
1309 random_seed=random_seed,
1310 )
1311 return _wrap_replicates(
1312 sim,
1313 num_replicates=num_replicates,
1314 replicate_index=replicate_index,
1315 provenance_dict=provenance_dict,
1316 )
File ~/work/tskit-site/tskit-site/msprime/ancestry.py:1081, in _parse_sim_ancestry(samples, sequence_length, recombination_rate, gene_conversion_rate, gene_conversion_tract_length, discrete_genome, population_size, demography, ploidy, model, initial_state, start_time, end_time, record_migrations, record_full_arg, additional_nodes, coalescing_segments_only, num_labels, random_seed, init_for_debugger)
1078 random_seed = _parse_random_seed(random_seed)
1079 random_generator = _msprime.RandomGenerator(random_seed)
-> 1081 return Simulator(
1082 tables=initial_state,
1083 recombination_map=recombination_map,
1084 gene_conversion_map=gene_conversion_map,
1085 gene_conversion_tract_length=gene_conversion_tract_length,
1086 discrete_genome=discrete_genome,
1087 ploidy=ploidy,
1088 demography=demography,
1089 models=models,
1090 store_migrations=record_migrations,
1091 additional_nodes=additional_nodes,
1092 coalescing_segments_only=coalescing_segments_only,
1093 start_time=start_time,
1094 end_time=end_time,
1095 num_labels=num_labels,
1096 random_generator=random_generator,
1097 )
File ~/work/tskit-site/tskit-site/msprime/ancestry.py:1426, in Simulator.__init__(self, tables, recombination_map, gene_conversion_map, gene_conversion_tract_length, discrete_genome, ploidy, demography, random_generator, models, store_migrations, additional_nodes, coalescing_segments_only, start_time, end_time, num_labels)
1423 gene_conversion_rate = gene_conversion_map.rate[0]
1425 start_time = -1 if start_time is None else start_time
-> 1426 super().__init__(
1427 tables=ll_tables,
1428 recombination_map=ll_recomb_map,
1429 start_time=start_time,
1430 random_generator=random_generator,
1431 migration_matrix=demography.migration_matrix,
1432 population_configuration=ll_population_configuration,
1433 demographic_events=ll_demographic_events,
1434 store_migrations=store_migrations,
1435 additional_nodes=additional_nodes.value,
1436 coalescing_segments_only=coalescing_segments_only,
1437 num_labels=num_labels,
1438 segment_block_size=segment_block_size,
1439 avl_node_block_size=avl_node_block_size,
1440 node_mapping_block_size=node_mapping_block_size,
1441 gene_conversion_rate=gene_conversion_rate,
1442 gene_conversion_tract_length=gene_conversion_tract_length,
1443 discrete_genome=discrete_genome,
1444 ploidy=ploidy,
1445 )
1446 # Highlevel attributes used externally that have no lowlevel equivalent
1447 self.end_time = np.inf if end_time is None else end_time
InputError: Input error in initialise: Attempt to sample lineage in a population with size=0
To recover the original model we must update the Population
objects in place and add the
population split event:
demography["A"].initial_size = 100
demography["B"].initial_size = 200
demography["C"].initial_size = 300
demography.add_population_split(time=100, ancestral="C", derived=["A", "B"])
demography
id | name | description | initial_size | growth_rate | default_sampling_time | extra_metadata |
---|---|---|---|---|---|---|
0 | A | Contemporary population A | 100.0 | 0 | 0 | {} |
1 | B | Contemporary population B | 200.0 | 0 | 0 | {} |
2 | C | Ancestral population | 300.0 | 0 | 1e+02 | {} |
time | type | parameters | effect |
---|---|---|---|
100 | Population Split | derived=[A, B], ancestral=C | Moves all lineages from derived populations 'A' and 'B' to the ancestral 'C' population. Also set the derived populations to inactive, and all migration rates to and from the derived populations to zero. |
Note
See the Demography
documentation for details on how to access
Population
objects.
The demography that we obtain from Demography.from_tree_sequence()
is really just a template, which we update and elaborate as we need.
In particular, we are free to add more populations as needed.
Continuing forwards-time simulations (“recapitating”)#
The most common use for the initial_state
argument is to
start the simulation from the output of a forwards-time simulation.
Informally, we take an ‘unfinished’
tree sequence as a parameter to sim_ancestry()
, initialise the simulation
from the state of this tree sequence and then run the simulation until
coalescence. The returned tree sequence is then the result of taking the
input tree sequence and completing the trees using the coalescent.
This is useful for forwards-time simulators such as
SLiM and
fwdpy11
that can output tree sequences. By running
forward-time simulation for a certain number of generations we obtain a
tree sequence, but these trees may not have had sufficient time to
reach a most recent common ancestor. By using the initial_state
argument
to sim_ancestry()
we can combine the best of both forwards- and
backwards-time simulators. The recent past can be simulated forwards
in time and the ancient past by the coalescent. The coalescent
simulation is initialised by the root segments of the
input tree sequence, ensuring that the minimal amount of ancestral
material possible is simulated.
Any tree sequence can be provided as input to this process, but there is a specific topological requirement that must be met for the simulations to be statistically correct. To ensure that ancestral segments are correctly associated within chromosomes when constructing the initial conditions for the coalescent simulation, forward-time simulators must retain the nodes corresponding to the initial generation. Furthermore, for every sample in the final generation (i.e. the extant population at the present time) there must be a path to one of the founder population nodes.
See also
See the recapitation tutorial from the pyslim documentation for a detailed explanation of this process when using SLiM.
Models#
The ancestry model determines the model under which the ancestral history of the sample is generated. It is concerned with the basic processes of how (for example) common ancestor and recombination events occur. For example, the default “standard” or Hudson coalescent is an efficient continuous time model, and the Discrete Time Wright-Fisher model allows for more detailed (if less efficient) simulations by making fewer mathematical approximations. We can combine multiple ancestry models in a simulation using a flexible notation.
See also
See the Quick reference for a summary of the available ancestry models
Important
The ancestry model is distinct from the demographic model, which is mostly independent of the chosen ancestry model.
Specifying ancestry models#
The ancestry models used during a simulation are specified using
the model
parameter to sim_ancestry()
. Each model is a
subclass of the AncestryModel
class (see the following
subsections for the available models, and examples of their usage).
Single models#
Most of the time we want to run a simulation under a single
ancestry model. By default, we run simulations under the
StandardCoalescent
model. If we wish to run
under a different model, we use the model
argument to
sim_ancestry()
. For example, here we use the
SMC
model instead of the
standard coalescent:
ts1 = msprime.sim_ancestry(
10,
sequence_length=10,
recombination_rate=0.1,
model=msprime.SmcApproxCoalescent(),
random_seed=1234)
We specify the model as an instance of one of the ancestry model classes. For nonparametric models, there is also a useful shorthand of the model name:
ts2 = msprime.sim_ancestry(
10,
sequence_length=10,
recombination_rate=0.1,
model="smc",
random_seed=1234)
assert ts1.equals(ts2, ignore_provenance=True)
This string form is useful for single model simulations; however, when working with Multiple models it is better to use the explicit class form so that the Model duration can be set.
Model duration#
Each ancestry model instance has a duration
associated
with it. This is the maximum amount of time that this model can run for. Thus,
if we wanted to run 10 generations of the DiscreteTimeWrightFisher
model we could write:
ts = msprime.sim_ancestry(
3,
population_size=10,
model=msprime.DiscreteTimeWrightFisher(duration=10),
random_seed=1234)
SVG(ts.draw_svg(y_axis=True))
Note
Using the duration value for a single model like this is identical to
specifying an end_time
value: see the Stopping simulations early
section for more information.
It is vital to understand that the duration
value here is not
an absolute time at which the simulation must stop, but rather
the number of generations that the model should run for. To illustrate
this we can use the start_time parameter:
ts = msprime.sim_ancestry(
3,
population_size=10,
start_time=5,
model=msprime.DiscreteTimeWrightFisher(duration=10),
random_seed=1234)
SVG(ts.draw_svg(y_axis=True))
We can see that the simulation continued until 15 generations ago, because it didn’t start until 5 generations ago and we then specified that the model should run for at most 10 generations.
The model activity time is specified as a duration in this way as this is a natural way to specify Multiple models, in which some of the models may run for random periods of time.
Model completion#
Most ancestry models will run until the overall simulation has
completed; that is, when we have simulated back to the most
recent common ancestor at every position along the sequence.
For example, the StandardCoalescent
and
DiscreteTimeWrightFisher
models will simulate
until coalescence, unless we explicitly indicate that we wish
to stop the simulation early via the
model duration
or the end_time parameter.
Models such as SweepGenicSelection
are different though:
they are only defined for a limited amount of time and
may not run until coalescence (see the
Selective sweeps
for more information on the model itself).
sweep = msprime.SweepGenicSelection(
position=5,
start_frequency=0.1,
end_frequency=0.9,
s=0.25,
dt=1e-6,
)
ts = msprime.sim_ancestry(
3,
sequence_length=10,
population_size=100,
model=sweep,
random_seed=1234)
SVG(ts.draw_svg(y_axis=True))
Here, we see that even though we didn’t specify a model duration or end_time parameter the simulation has ended before we reached coalescence. In addition, the model duration is random:
ts = msprime.sim_ancestry(
3,
sequence_length=10,
population_size=100,
model=sweep,
random_seed=12345)
SVG(ts.draw_svg(y_axis=True))
Here we ran the same simulation with a different random seed, and we can see that the time the model completed was different in the two cases (30.22 vs 28.15 generations ago).
Note
Models such as SweepGenicSelection
can also complete due
to full coalescence.
Multiple models#
Sometimes we are interested in combining different models in a simulation.
To do this we provide a list of ancestry model instances as the
model
argument to sim_ancestry()
. The models
are run in the order they are specified. For each model, if its
duration
is set, then the simulation will run
for at most that additional time duration. If the duration
is
not set (or None
), then the model will run until
completion.
For example, here we run a hybrid DTWF and coalescent simulation (see the Discrete Time Wright-Fisher section for more details):
ts = msprime.sim_ancestry(
2,
population_size=1000,
model=[
msprime.DiscreteTimeWrightFisher(duration=500),
msprime.StandardCoalescent(),
],
random_seed=2
)
Warning
It is very imporant to remember to set a duration
value in the first
model here! Otherwise, the entire simulation will run under the DTWF model.
In this example we’ve run the DiscreteTimeWrightFisher
model for the first 500 generations, and then switched to the
StandardCoalescent
. Because we did not specify a
duration
in the second model, it will run until coalescence.
Note
Switching models at a fixed time point like this is equivalent to
first running the DTWF phase of the simulation with end_time=500
and then using the output as the
initial state for the simulation
of the Hudson phase. The model switching syntax here is a little
more convenient and efficient, however.
See also
See the Multiple sweeps section for an example of running many models with random durations.
Tip
The logging output of msprime can be very useful when working with multiple models. See the Logging section for more details.
Hudson coalescent#
The standard coalescent
is the default model in msprime. The algorithm
for simulating the coalescent with recombination was developed
by Hudson (1983),
and so we refer to it as the “Hudson” coalescent.
Running a simulation without specifying a model
is the
same as running with model="hudson"
:
ts1 = msprime.sim_ancestry(5, random_seed=2)
ts2 = msprime.sim_ancestry(5, model="hudson", random_seed=2)
# This is the same simulation so tree sequences are equal
assert ts1.equals(ts2, ignore_provenance=True)
Time is measured in units of generations ago. This is a continuous time model, in which the simulation moves event-by-event backwards in time (contrasting with the Discrete Time Wright-Fisher model, which works generation-by-generation). Thus, we can have fractional generations, as in this example where the MRCA of the sample occurs 10.59 “generations” ago:
ts = msprime.sim_ancestry(3, random_seed=234)
SVG(ts.draw_svg(y_axis=True))
This occurs because time is scaled to be proportional to
the population size; if we run the same simulation with a
different population size, we can see that
the time values are simply scaled up accordingly
(the default population_size
is 1):
ts = msprime.sim_ancestry(3, population_size=100, random_seed=234)
SVG(ts.draw_svg(y_axis=True))
See also
The ploidy parameter also controls the time scale; the default here is the diploid time scale where time is measured in units of 4N generations.
SMC approximations#
The SMC
and SMC'
are approximations of the continuous time
Hudson coalescent model. These were originally
motivated largely by the need to simulate coalescent processes more efficiently
than was possible using the software available at the time; however,
improved algorithms
mean that such approximations are now mostly unnecessary for simulations.
The SMC and SMC’ are however very important for inference, as the approximations have made many analytical advances possible.
Since the SMC approximations are not required for simulation efficiency, these models are implemented using a naive rejection sampling approach in msprime. The implementation is intended to facilitate the study of the SMC approximations, rather than to be used in a general-purpose way.
Discrete Time Wright-Fisher#
Msprime provides the option to perform discrete-time Wright-Fisher (DTWF)
simulations for scenarios when the coalescent model is not appropriate, including large
sample sizes, multiple chromosomes, or recent migration.
Please see the API documentation
for a formal description of the model,
and Nelson et al. 2020 for
more details and background information.
To run DTWF simulations, we need to provide the model="dtwf"
argument
to sim_ancestry()
:
ts = msprime.sim_ancestry(4, population_size=10, model="dtwf", random_seed=7)
SVG(ts.draw_svg(y_axis=True, time_scale="log_time"))
There are a few important points to note here:
All node times are integers (this is discrete time counted in generations);
We can have non-binary nodes (see node 10);
We can have simultaneous events (see nodes 8 and 9).
Most features can be used with the DTWF model, and an error will be raised if you attempt to use a feature that is not implemented for DTWF. (Please let us know if this feature is important to you!)
Important
For DTWF simulations with population structure, each row of the migration matrix must sum to one or less.
Because DTWF simulations trace the history of a sample by going backwards in time generation-by-generation rather than event-by-event like the continuous time models, it is significantly slower to simulate than the standard coalescent.
“Hybrid” simulations can be a good solution if DTWF simulations are slow. In this case we can use DTWF simulations for the recent past, and using the standard coalescent for the more distant past (where it is an excellent approximation of the discrete time model). Hybrid simulations of this type provide the best of computational efficiency and accuracy.
See also
See Nelson et al. 2020 for more details and examples where the DTWF model is more realistic than the coalescent, and an assessment of the accuracy of hybrid simulations.
For example, here we simulate with the DTWF model for 500 generations, before switching to the standard (Hudson) coalescent:
ts = msprime.sim_ancestry(
2,
population_size=1000,
model=[
msprime.DiscreteTimeWrightFisher(duration=500),
msprime.StandardCoalescent(),
],
random_seed=4
)
SVG(ts.draw_svg(y_axis=True, time_scale="log_time"))
Because of the integer node times, we can see here that most of the coalescent happened during the Wright-Fisher phase of the simulation, and as-of 500 generations in the past, there were only two lineages left. The continuous time standard coalescent model was then used to simulate the ancient past of these two lineages.
See the Specifying ancestry models section for more details on
how the use the model
argument to sim_ancestry()
to run
multiple models.
See also
See the Multiple chromosomes section for information on how to simulate multiple chromosomes using the DTWF.
Multiple merger coalescents#
Some evolutionary scenarios, such as a skewed offspring distribution
combined with a type III survivorship curve, range expansion, and
rapid adaptation, can predict genealogies with up to four simultaneous
multiple mergers. Msprime provides the option to simulate from two classes
of such genealogical processes: the BetaCoalescent
and
the DiracCoalescent
. Please see the API documentation for formal
details of the models.
For haploid organisms, both models result in genealogies in which any number of lineages can merge into a common ancestor, but only one merger event can take place at a given time. For \(p\)-ploids, up to \(2p\) simultaneous mergers can take place, corresponding to the \(2p\) available parental chromosome copies.
The diploid Beta-Xi-coalescent can be simulated as follows:
ts = msprime.sim_ancestry(
samples=5, ploidy=2, random_seed=1,
model=msprime.BetaCoalescent(alpha=1.001))
SVG(ts.draw_svg())
The specified value of \(\alpha = 1.001\) corresponds to a heavily skewed offspring distribution. Values closer to \(\alpha = 2\) result in trees whose distribution is closer to that of the standard coalescent, often featuring no multiple mergers for small sample sizes:
ts = msprime.sim_ancestry(
samples=4, ploidy=2, random_seed=1,
model=msprime.BetaCoalescent(alpha=1.8))
SVG(ts.draw_svg())
Multiple mergers still take place in a haploid simulation, but only one merger can take place at a given time:
ts = msprime.sim_ancestry(
samples=10, ploidy=1, random_seed=1,
model=msprime.BetaCoalescent(alpha=1.001))
SVG(ts.draw_svg())
A haploid simulation results in larger individual mergers than a polyploid simulation because large mergers typically get broken up into multiple simultaneous mergers in the polyploid model.
The number of generations between merger events in the Beta-coalescent depends nonlinearly on both \(\alpha\) and the population size \(N\) as detailed above. For a fixed \(\alpha\), the number of generations between common ancestor events is proportional to \(N^{\alpha - 1}\), albeit with a complicated constant of proportionality that depends on \(\alpha\). The dependence on \(\alpha\) for fixed \(N\) is not monotone. Thus, branch lengths and the number of generations until a most recent common ancestor depend on both of these parameters.
To illustrate, for \(\alpha\) close to 2 the relationship between effective population size and number of generations is almost linear:
ts = msprime.sim_ancestry(
samples=1, ploidy=2, random_seed=1, population_size=10,
model=msprime.BetaCoalescent(alpha=1.99))
tree = ts.first()
print(tree.tmrca(0,1))
ts = msprime.sim_ancestry(
samples=1, ploidy=2, random_seed=1, population_size=1000,
model=msprime.BetaCoalescent(alpha=1.99))
tree = ts.first()
print(tree.tmrca(0,1))
0.14959691919068155
14.286394871874865
For \(\alpha\) close to 1 the effective population size has little effect:
ts = msprime.sim_ancestry(
samples=1, ploidy=2, random_seed=1, population_size=10,
model=msprime.BetaCoalescent(alpha=1.1))
tree = ts.first()
print(tree.tmrca(0,1))
ts = msprime.sim_ancestry(
samples=1, ploidy=2, random_seed=1, population_size=1000,
model=msprime.BetaCoalescent(alpha=1.1))
tree = ts.first()
print(tree.tmrca(0,1))
16.311807036386615
25.85247192870844
Important
Both the haploid and polyploid timescales collapse to zero as \(\alpha\) approaches 2. For reasons of numerical stability, values above \(\alpha = 1.991\) have been prohibited. The Hudson coalescent provides a very good approximation for \(1.991 <= \alpha <= 2\).
The Dirac-coalescent is simulated similarly in both the diploid case:
ts = msprime.sim_ancestry(
samples=5, ploidy=2, random_seed=1,
model=msprime.DiracCoalescent(psi=0.9, c=10))
SVG(ts.draw_svg())
and in the haploid case:
ts = msprime.sim_ancestry(
samples=10, ploidy=1, random_seed=1,
model=msprime.DiracCoalescent(psi=0.9, c=10))
SVG(ts.draw_svg())
As with the Beta-coalescent, a haploid simulation results in larger individual mergers than a polyploid simulation because large mergers typically get broken up into multiple simultaneous mergers in the polyploid model. Larger values of the parameter \(c > 0\) result in more frequent multiple mergers, while larger values of \(0 < \psi \leq 1\) result in multiple mergers with more participating lineages. Setting either parameter to 0 would correspond to the standard coalescent.
The Dirac-coalescent is obtained as the infinite population scaling limit of Moran models, and therefore branch lengths are proportional to \(N^2\) generations, as opposed to \(N\) generations under the standard coalescent.
ts = msprime.sim_ancestry(
samples=1, ploidy=2, random_seed=1, population_size=10,
model=msprime.DiracCoalescent(psi=0.1, c=1))
tree = ts.first()
print(tree.tmrca(0,1))
ts = msprime.sim_ancestry(
samples=1, ploidy=2, random_seed=1, population_size=100,
model=msprime.DiracCoalescent(psi=0.1, c=1))
tree = ts.first()
print(tree.tmrca(0,1))
161.80418322293528
16180.418322293526
Because branch lengths are proportional to \(N^2\) generations, the number of events simulated along branches (such as mutations or recombinations) will also be proportional to \(N^2\). For example, if a tree sequence generated using a given census size \(N\) is used as input for simulating segregating sites (see Mutation simulations for details on how to do this), the expected number of segregating sites obtained will be proportional to \(N^2\).
ts = msprime.sim_ancestry(
samples=1, ploidy=2, random_seed=1, population_size=10000,recombination_rate=1e-8,
sequence_length=10**4,model=msprime.DiracCoalescent(psi=0.1, c=0))
mts = msprime.sim_mutations(ts, rate=1e-8)
print(str(mts.get_num_sites()))
6536
If the desired amount of diversity (as measured by the number of segregating sites) is proportional to \(N\), then \(N^{1/2}\) should be used for the population size under the Dirac-coalescent.
ts = msprime.sim_ancestry(
samples=1, ploidy=2, random_seed=1, population_size=100,recombination_rate=1e-8,
sequence_length=10**4,model=msprime.DiracCoalescent(psi=0.1, c=0))
mts = msprime.sim_mutations(ts, rate=1e-8)
print(str(mts.get_num_sites()))
5
Therefore the population size must be carefully chosen (and potentially rescaled) to obtain the desired simulated data in these models.
Selective sweeps#
Although the coalescent assumes that lineages are exchangable, (and therefore evolving neutrally), approximations to some forms of selective sweep (beneficial mutation moves through the population), have been derived. This is done through the use of a structured coalescent model in the spirit of Braverman et al. (1995).
In the SweepGenicSelection
model,
the population is split between two classes of selective backgrounds,
those linked to the beneficial allele, call it \(B\), and those not,
\(b\). In this model we track competing
rates of coalescence and recombination on the \(B\) and \(b\)
backgrounds.
The user supplies a final allele frequency and a starting
allele frequency, between which msprime simulates a stochastic
sweep trajectory according to a conditional diffusion model (Coop
and Griffiths, 2004;
Kim and Stephan 2002).
Beyond the start and end frequencies of the sweep trajectory, the user must specify the selection coefficient of the beneficial mutation \(s\) the selective advantage of the \(B\) homozygote over the \(b\) homozygote and \(h=0.5\). The position represents the location along the chromosome where the beneficial allele occurs. All other parameters can be set as usual.
Warning
The implementation of the structured coalescent is quite general, but there are some limitations in the current implementation of the sweeps model (e.g., no change of population size during a sweep). Please let us know if there are related features you would like to see (or if you would be interested in helping to create this functionality!)
Hard sweeps#
Todo
What’s a hard sweep? Some intro sentences here.
In this example we run some replicate hard sweep simulations
and plot the mean
pairwise diversity
in windows
across the simulated region, showing the characteristic
valley of diversity.
First we set up some basic parameters, and define the sweep model:
Ne = 1e3
L = 1e6 # Length of simulated region
num_reps = 100
# define hard sweep model
sweep_model = msprime.SweepGenicSelection(
position=L / 2, # middle of chrom
start_frequency=1.0 / (2 * Ne),
end_frequency=1.0 - (1.0 / (2 * Ne)),
s=0.25,
dt=1e-6,
)
Todo
Explain why these parameters give a hard sweep.
Next we set up the replicate simulations. As described
in the Multiple models section,
the model
parameter is a list when we want to run multiple
ancestry models. In this example, we have a sweep that occurs
in the immediate past, and is then followed by the
standard coalescent for the rest of time:
reps = msprime.sim_ancestry(
5,
model=[sweep_model, msprime.StandardCoalescent()],
population_size=Ne,
recombination_rate=1e-7,
sequence_length=L,
num_replicates=num_reps,
)
Note
Because the SweepGenicSelection
model has a random
duration we do
not set this value in the model. Please see the
Model completion section for
more discussion on this important point.
Once we’ve set up the replicate simulations we can compute the
windows for plotting, run the actual simulations
(see the Running replicate simulations section for more details)
and compute the
pairwise diversity
.
wins = np.linspace(0, L, 21)
mids = (wins[1:] + wins[:-1]) / 2
diversity = np.zeros((num_reps, mids.shape[0]))
for j, ts in enumerate(reps):
diversity[j] = ts.diversity(windows=wins, mode="branch")
Finally, we can plot the observed mean diversity across the replicates and compare it to the neutral expectation:
from matplotlib import pyplot as plt
plt.plot(mids, diversity.mean(axis=0), label="Simulations")
plt.axhline(4 * Ne, linestyle=":", label=r'Neutral expectation')
plt.ylabel(r'Branch $\pi$');
plt.xlabel('Position (bp)')
plt.legend();
Note
We use the “branch” measure of pairwise diversity which is defined in terms of the trees rather than nucleotide sequences. See Ralph et al. 2020 for more information.
As we can see, the selective sweep has reduced variation in the region most closely linked to the beneficial allele and then diversity increases with distance to each side.
Multiple sweeps#
Todo
This example is very artificial; it would be much better to illustrate with a meaningful example from the literature. Pull requests welcome!
To illustrate the idea that we can simulate large numbers of different models, in this example we create a simulation in which we have sweeps happening at random points along the genome, which are separated by a random duration of the standard coalescent.
First we set up our models:
import random
random.seed(1234)
L = 100
Ne = 1e6
models = []
for _ in range(100):
models.append(msprime.StandardCoalescent(duration=random.uniform(0, 1)))
models.append(
msprime.SweepGenicSelection(
position=random.randint(1, L - 1),
start_frequency=1 / (2 * Ne),
end_frequency=1.0 - 1 / (2 * Ne),
s=0.01,
dt=1e-6,
)
)
models.append(msprime.StandardCoalescent())
models[:5]
[StandardCoalescent(duration=0.9664535356921388),
SweepGenicSelection(duration=None, position=57, start_frequency=5e-07, end_frequency=0.9999995, s=0.01, dt=1e-06),
StandardCoalescent(duration=0.11685051774599753),
SweepGenicSelection(duration=None, position=12, start_frequency=5e-07, end_frequency=0.9999995, s=0.01, dt=1e-06),
StandardCoalescent(duration=0.9109759624491242)]
We’ve created a list of models that have random parameters using the standard Python tools. There are lots of sweeps (many more than we will actually run, in all probability), each at a random position on the genome and each separated by a random amount of time running the standard coalescent. Finally, we add an instance of the standard coalescent without a duration to guarantee that our simulation will always coalesce fully (See the Model completion section for more information.) Finally, we print out the first 5 models so we can inspect them.
Then, we can run our simulation as usual:
ts = msprime.sim_ancestry(
10,
population_size=100,
sequence_length=L,
recombination_rate=0.01,
model=models,
random_seed=6789
)
ts
Tree Sequence | |
---|---|
Trees | 100 |
Sequence Length | 100.0 |
Time Units | generations |
Sample Nodes | 20 |
Total Size | 102.7 KiB |
Metadata | No Metadata |
Table | Rows | Size | Has Metadata |
---|---|---|---|
Edges | 1588 | 49.6 KiB | |
Individuals | 10 | 304 Bytes | |
Migrations | 0 | 8 Bytes | |
Mutations | 0 | 16 Bytes | |
Nodes | 523 | 14.3 KiB | |
Populations | 1 | 224 Bytes | ✅ |
Provenances | 1 | 25.8 KiB | |
Sites | 0 | 16 Bytes |
Tip
The logging output of msprime can be very useful when working with multiple models. See the Logging section for more details.
Fixed pedigree#
A pedigree describes parent-offspring relationships between individuals, and
these relationships constrain how genomes are inherited. We can simulate
ancestry conditioned on a fixed pedigree using the FixedPedigree
ancestry model. In this model the transfer of genetic material from parent to
offspring is determined by the pedigree. Recombination occurs randomly such that
each of a diploid individual’s two genomes is a mixture of two corresponding
parental genomes, but everything else is determined by the pedigree.
Warning
It is important to note that there is no concept of a population genetic model “outside” of the pedigree in this model, and there are significant caveats when dealing with incomplete pedigree information and the treatment of founders. Please see the Completing fixed pedigree simulations section for more information.
Simple example#
First we define a small pedigree, which follows a sample of three individuals through four generations:
See also
See the Pedigrees section for details on this pedigree text format and information on how pedigrees are encoded in msprime via the individual and node tables.
import io
ped_txt = """\
# id parent0 parent1 time is_sample
0 5 4 0.0 1
1 3 3 0.0 1
2 3 4 0.0 1
3 7 7 1.0 0
4 8 8 1.0 0
5 8 6 1.0 0
6 9 10 2.0 0
7 10 10 2.0 0
8 9 9 2.0 0
9 11 12 3.0 0
10 11 12 3.0 0
11 . . 4.0 0
12 . . 4.0 0
"""
pedigree = msprime.parse_pedigree(io.StringIO(ped_txt), sequence_length=100)
draw_pedigree(pedigree.tree_sequence())
Note
See the Visualising pedigrees section for a
definition of the draw_pedigree
function used here.
We then use this pedigree information as the
initial state for the simulation
(note that we set the sequence length on the pedigree tables
when we called parse_pedigree()
):
ped_ts = msprime.sim_ancestry(
initial_state=pedigree, model="fixed_pedigree", random_seed=41,
recombination_rate=0.001)
node_labels = {node.id: f"{node.individual}({node.id})" for node in ped_ts.nodes()}
SVG(ped_ts.draw_svg(y_axis=True, node_labels=node_labels, size=(600,200)))
The output tree sequence contains the simulated genetic ancestry conditioned on this input pedigree, as shown in the trees above. Each node is labelled with it the individual ID it’s associated with, and its ID (as individual(node)). The simulation stops at time 4, and we are left with multiple roots, which correspond to the founder individuals. If you only need simulations of the genetic ancestry through the pedigree, then these trees can be used perfectly well in downstream applications (such as Mutation simulations). See the Completing fixed pedigree simulations section for information (and caveats about) completing these simulations.
Note
Because the input pedigree fully describes the simulation many features such as
demography are not available when we use the
FixedPedigree
model. The FixedPedigree
model also cannot be
combined with other ancestry models.
Censoring pedigree information#
Warning
The output tree sequence also contains the full pedigree information provided as input. This may be important if you are performing simulations on pedigrees that are not freely available.
It is straightforward to remove the pedigree information from an output tree sequence, if required:
tables = ped_ts.dump_tables()
tables.individuals.clear()
tables.nodes.individual = np.full_like(tables.nodes.individual, tskit.NULL)
censored_ts = tables.tree_sequence()
censored_ts
Tree Sequence | |
---|---|
Trees | 2 |
Sequence Length | 100.0 |
Time Units | generations |
Sample Nodes | 6 |
Total Size | 2.6 KiB |
Metadata | No Metadata |
Table | Rows | Size | Has Metadata |
---|---|---|---|
Edges | 13 | 424 Bytes | |
Individuals | 0 | 40 Bytes | |
Migrations | 0 | 8 Bytes | |
Mutations | 0 | 16 Bytes | |
Nodes | 26 | 736 Bytes | |
Populations | 1 | 224 Bytes | ✅ |
Provenances | 1 | 1.0 KiB | |
Sites | 0 | 16 Bytes |
Completing fixed pedigree simulations#
If you wish to simulate ancestry older than what is defined in the pedigree, you can “complete” (or “recapitate”) the within-pedigree simulation using the standard methods described in the Specifying the initial state section.
Continuing the example in the previous section, we provide the simulated
ancestry in ped_ts
as the initial state for a DiscreteTimeWrightFisher
simulation with a population size of 10:
final_ts = msprime.sim_ancestry(
initial_state=ped_ts, model="dtwf", population_size=10, random_seed=42)
SVG(final_ts.draw_svg(y_axis=True, size=(600,200)))
We can see that the nodes 22 and 23 representing the founder genomes from the input pedigree are used as the starting point for the DTWF simulation, which then finishes at generation 6 when the MRCA (node 26) is found.
Missing pedigree data#
The previous example is very straightforward because the pedigree is
complete and we can trace all individuals back to the same
generation in the pedigree. The pedigrees we have in practice
are usually not like this, and there is a great deal of missing data.
The consequence of this for the FixedPedigree
simulations
is that genetic material will often be at a “dead end” at many
levels of the pedigree during a simulation.
ped_txt = """\
# id parent0 parent1 time is_sample
0 5 4 0.0 1
1 3 3 0.0 1
2 3 4 0.0 1
3 7 7 1.0 0
4 8 8 1.0 0
5 8 6 1.0 0
6 9 10 2.0 0
7 . . 2.0 0
8 9 9 2.0 0
9 11 12 3.0 0
10 11 12 3.0 0
11 . . 4.0 0
12 . . 4.0 0
"""
pedigree = msprime.parse_pedigree(io.StringIO(ped_txt), sequence_length=100)
draw_pedigree(pedigree.tree_sequence())
This pedigree has a “dead end” for individual 7, which is
reflected in the trees that we output from the FixedPedigree
simulation:
ped_ts = msprime.sim_ancestry(
initial_state=pedigree, model="fixed_pedigree", random_seed=41,
recombination_rate=0.001)
node_labels = {node.id: f"{node.individual}({node.id})" for node in ped_ts.nodes()}
SVG(ped_ts.draw_svg(y_axis=True, node_labels=node_labels, size=(600,200)))
When we provide this tree sequence as the initial state of another simulation, we then effectively start the simulation at the earlier time point corresponding to individual 7. Ancestral material in older parts of the pedigree is then brought into the simulation in the same manner as ancient samples as we proceed backwards in time.
Warning
It is important to note that although ancestral material for pedigree founders is introduced into the recapitation simulation at the correct time, no account is taken of the number of lineages present in the pedigree when calculating population sizes. Thus, the pedigree must be seen as entirely external to the population model simulated during recapitation.
Pedigrees and demography#
For complex simulations in which our pedigree individuals (and founders, in particular) are drawn from different populations, we will want our follow-up recapitation simulations to use a demographic model. In order to do this, we must define the demographic model before creating the input pedigree. For example, consider the following simple demography:
demography = msprime.Demography()
demography.add_population(name="A", initial_size=10)
demography.add_population(name="B", initial_size=20)
demography.add_population(name="C", initial_size=100)
demography.add_population_split(time=10, derived=["A", "B"], ancestral="C");
We define a pedigree in which the individuals belong to the two “leaf” populations, A and B:
ped_txt = """\
# id parent0 parent1 time is_sample population
0 9 11 0.0 1 A
1 10 10 0.0 1 A
2 6 7 0.0 1 A
3 7 7 0.0 1 A
4 8 9 0.0 1 A
5 10 10 0.0 1 A
6 14 14 1.0 0 A
7 12 12 1.0 0 A
8 16 17 1.0 0 A
9 12 13 1.0 0 A
10 14 12 1.0 0 A
11 15 16 1.0 0 A
12 . . 2.0 0 A
13 . . 2.0 0 A
14 . . 2.0 0 A
15 . . 2.0 0 A
16 . . 2.0 0 A
17 . . 2.0 0 A
18 23 22 0.0 1 B
19 22 22 0.0 1 B
20 21 22 0.0 1 B
21 25 25 1.0 0 B
22 24 25 1.0 0 B
23 24 24 1.0 0 B
24 27 26 2.0 0 B
25 27 26 2.0 0 B
26 30 28 3.0 0 B
27 30 29 3.0 0 B
28 31 32 4.0 0 B
29 31 32 4.0 0 B
30 32 31 4.0 0 B
31 . . 5.0 0 B
32 . . 5.0 0 B
"""
pedigree = msprime.parse_pedigree(
io.StringIO(ped_txt), demography=demography, sequence_length=100)
draw_pedigree(pedigree.tree_sequence())
We can see that the pedigree is split into two sub-pedigrees for the
two populations, and the founders for the two populations exist
at different times (2 and 5 generations ago for A and B, respectively).
Running the FixedPedigree
simulation, we can see that the
trees reflect this structure:
ped_ts = msprime.sim_ancestry(
initial_state=pedigree, model="fixed_pedigree", random_seed=41)
node_labels = {node.id: f"{node.individual}({node.id})" for node in ped_ts.nodes()}
SVG(ped_ts.draw_svg(y_axis=True, node_labels=node_labels, size=(600,200)))
When we complete this simulation we must use a Demography
object
that matches the populations defined in the original pedigree tables. The
simplest way to do this is define the demography first, and use the same
object throughout as we have done here:
full_ts = msprime.sim_ancestry(
initial_state=ped_ts, demography=demography, random_seed=42, model="dtwf")
node_labels = {
node.id: f"{node.id}:{full_ts.population(node.population).metadata['name']}"
for node in full_ts.nodes()}
SVG(full_ts.draw_svg(
y_axis=True, node_labels=node_labels, size=(600,300), time_scale="rank"))
Here we have changed the node labelling in the trees to show the name of the population of each node. We can see that all of the early nodes are in the populations defined from the pedigree. When the recapitation simulation begins, all nodes are assigned into the relevant populations and the simulation proceeds from there under the defined demography. In this demographic model, populations A and B split from C 10 generations ago, and so the remaining lineages in populations A and B evolve independently until this time. We can see that older coalescences all happen within population C, as stipulated by the demographic model.
Warning
The same caveats that we noted above about missing data in the pedigree apply: although the remaining ancestral lineages will join the simulation at the correct times and in the specified population, no account is taken of the number of extant lineages from the pedigree when computing population sizes.
Missing data#
Todo
Give an example of doing a simulation with unknown rates on the flanks and show that the resulting tree sequence has missing data in these regions.
Common errors#
Infinite waiting time#
Todo
explain this, why it happens and give examples of when we don’t detect it. Mention the possible_lineage_locations method.
demography = msprime.Demography()
demography.add_population(name="A", initial_size=1)
demography.add_population(name="B", initial_size=1)
msprime.sim_ancestry(samples={"A": 1, "B": 1}, demography=demography)
---------------------------------------------------------------------------
LibraryError Traceback (most recent call last)
Cell In[105], line 4
2 demography.add_population(name="A", initial_size=1)
3 demography.add_population(name="B", initial_size=1)
----> 4 msprime.sim_ancestry(samples={"A": 1, "B": 1}, demography=demography)
File ~/work/tskit-site/tskit-site/msprime/ancestry.py:1311, in sim_ancestry(samples, demography, sequence_length, discrete_genome, recombination_rate, gene_conversion_rate, gene_conversion_tract_length, population_size, ploidy, model, initial_state, start_time, end_time, record_migrations, record_full_arg, additional_nodes, coalescing_segments_only, num_labels, random_seed, num_replicates, replicate_index, record_provenance)
1289 provenance_dict = provenance.get_provenance_dict(parameters)
1290 sim = _parse_sim_ancestry(
1291 samples=samples,
1292 sequence_length=sequence_length,
(...)
1309 random_seed=random_seed,
1310 )
-> 1311 return _wrap_replicates(
1312 sim,
1313 num_replicates=num_replicates,
1314 replicate_index=replicate_index,
1315 provenance_dict=provenance_dict,
1316 )
File ~/work/tskit-site/tskit-site/msprime/ancestry.py:681, in _wrap_replicates(simulator, num_replicates, replicate_index, provenance_dict, mutation_rate)
675 iterator = simulator.run_replicates(
676 num_replicates,
677 mutation_rate=mutation_rate,
678 provenance_dict=provenance_dict,
679 )
680 if replicate_index is not None:
--> 681 deque = collections.deque(iterator, maxlen=1)
682 return deque.pop()
683 else:
File ~/work/tskit-site/tskit-site/msprime/ancestry.py:1606, in Simulator.run_replicates(self, num_replicates, mutation_rate, provenance_dict)
1604 for replicate_index in range(num_replicates):
1605 logger.info("Starting replicate %d", replicate_index)
-> 1606 self.run()
1607 if mutation_rate is not None:
1608 # This is only called from simulate() or the ms interface,
1609 # so does not need any further parameters.
1610 mutations._simple_mutate(
1611 tables=self.tables,
1612 random_generator=self.random_generator,
1613 sequence_length=self.sequence_length,
1614 rate=mutation_rate,
1615 )
File ~/work/tskit-site/tskit-site/msprime/ancestry.py:1571, in Simulator.run(self, event_chunk, debug_func)
1569 raise ValueError("Model durations must be >= 0")
1570 end_time = min(self.time + model_duration, self.end_time)
-> 1571 exit_reason = self._run_until(end_time, event_chunk, debug_func)
1572 if exit_reason == ExitReason.COALESCENCE or self.time == self.end_time:
1573 logger.debug("Skipping remaining %d models", len(self.models) - j - 1)
File ~/work/tskit-site/tskit-site/msprime/ancestry.py:1536, in Simulator._run_until(self, end_time, event_chunk, debug_func)
1534 ret = ExitReason.MAX_EVENTS
1535 while ret == ExitReason.MAX_EVENTS:
-> 1536 ret = ExitReason(super().run(end_time, event_chunk))
1537 if self.time > end_time:
1538 # Currently the Pedigree and Sweeps models are "non-reentrant"
1539 # We can change this to an assertion once these have been fixed.
1540 raise RuntimeError(
1541 f"Model {self.model['name']} does not support interruption. "
1542 "Please open an issue on GitHub"
1543 )
LibraryError: Infinite waiting time until next simulation event.