Ancestry simulations#

Msprime simulates the ancestry of sampled genomes under a number of different backwards-in-time population genetic models. The simulated ancestry is represented as a succinct tree sequence using the tskit library, which provides an extensive suite of operations for analysing genealogical trees and DNA sequence data. See the Getting started with tskit tutorial for more information on how to process data using tskit.

Here we run a simple simulation and show a summary of the resulting tree sequence:

ts = msprime.sim_ancestry(2)
ts
Tree Sequence
Trees1
Sequence Length1.0
Time Unitsgenerations
Sample Nodes4
Total Size1.8 KiB
MetadataNo Metadata
Table Rows Size Has Metadata
Edges 6 200 Bytes
Individuals 2 80 Bytes
Migrations 0 8 Bytes
Mutations 0 16 Bytes
Nodes 7 204 Bytes
Populations 1 224 Bytes
Provenances 1 1018 Bytes
Sites 0 16 Bytes

The sim_ancestry() function has many parameters to specify the details of the simulations. The API documentation contains a precise description of the parameter; the sections in this page provide some explanation and examples of how to use these parameters.

Note

It’s important to note that sim_ancestry() only simulates the ancestral trees for samples: if we want actual sequence data then we also need to simulate mutations on these trees. See the Mutation simulations section for more information on how to do this.


Quick reference#

sim_ancestry()

Simulate ancestral tree sequence topology

Models

StandardCoalescent

Coalescent with recombination (“hudson”)

SmcApproxCoalescent

Sequentially Markov Coalescent (“smc”)

SmcPrimeApproxCoalescent

SMC’(“smc_prime”)

DiscreteTimeWrightFisher

Generation-by-generation Wright-Fisher

FixedPedigree

Simulated conditioned on a given pedigree

BetaCoalescent

Beta coalescent multiple-merger

DiracCoalescent

Dirac coalescent multiple-merger

SweepGenicSelection

Selective sweep at a linked locus


Specifying samples#

The samples argument to sim_ancestry() defines the number of sampled individuals whose history will be simulated. There are three different forms; the samples argument can be:

  • an integer, interpreted as the number of sampled individuals to draw in a single population model;

  • a dictionary mapping population references (either integer IDs or names) to the number of sampled individuals for that population;

  • a list of SampleSet objects, which provide more flexibility in how groups of similar samples are drawn from populations.

In the simplest case, provide a single integer which defines the number of samples:

ts = msprime.sim_ancestry(2, random_seed=1234)
ts
Tree Sequence
Trees1
Sequence Length1.0
Time Unitsgenerations
Sample Nodes4
Total Size1.8 KiB
MetadataNo Metadata
Table Rows Size Has Metadata
Edges 6 200 Bytes
Individuals 2 80 Bytes
Migrations 0 8 Bytes
Mutations 0 16 Bytes
Nodes 7 204 Bytes
Populations 1 224 Bytes
Provenances 1 1012 Bytes
Sites 0 16 Bytes

Here we specify two sample individuals, which at the default ploidy of two, gives us four sample nodes.

Warning

It is important to note that the number of samples refers to the number of individuals not the number of nodes (monoploid genomes). See the Ploidy section for details.

This integer form for the samples argument can only be used when the demography is a single population model.

Populations#

The next example illustrates one usage of the dictionary form of the samples argument, in which the keys refer to populations in the demographic model and the values are the number of samples to draw. (See the Demography section for more details on the demography argument to sim_ancestry().) We first create a Demography object representing a 10 deme linear stepping stone model. Then, we run the simulation with 3 diploid samples each drawn from the first and last demes in this linear habitat.

N = 10
demography = msprime.Demography.stepping_stone_model(
    [100] * N,
    migration_rate=0.1,
    boundaries=True)
ts = msprime.sim_ancestry({0: 3, N - 1: 3}, demography=demography)
ts
Tree Sequence
Trees1
Sequence Length1.0
Time Unitsgenerations
Sample Nodes12
Total Size5.6 KiB
MetadataNo Metadata
Table Rows Size Has Metadata
Edges 22 712 Bytes
Individuals 6 192 Bytes
Migrations 0 8 Bytes
Mutations 0 16 Bytes
Nodes 23 652 Bytes
Populations 10 593 Bytes
Provenances 1 3.3 KiB
Sites 0 16 Bytes

The keys in the dictionary can also be the string names of the population (see the Identifying populations section), which is useful when we are simulating from empirically estimated models. For example, here create a Demography object based on a species tree, and then draw samples using the species names.

demography = msprime.Demography.from_species_tree(
    "(((human:5.6,chimpanzee:5.6):3.0,gorilla:8.6):9.4,orangutan:18.0)",
    time_units="myr",
    initial_size=10**4,
    generation_time=20)
ts = msprime.sim_ancestry({"gorilla": 2, "human": 4}, demography=demography)
ts
Tree Sequence
Trees1
Sequence Length1.0
Time Unitsgenerations
Sample Nodes12
Total Size5.0 KiB
MetadataNo Metadata
Table Rows Size Has Metadata
Edges 22 712 Bytes
Individuals 6 192 Bytes
Migrations 0 8 Bytes
Mutations 0 16 Bytes
Nodes 23 652 Bytes
Populations 7 481 Bytes
Provenances 1 2.8 KiB
Sites 0 16 Bytes

When samples are drawn using this dictionary form it is always at the population’s Default sampling time. If you wish to draw samples at a different time, then you must use the more general form of a list of SampleSet objects.

See also

See the Demography section for more information on how to specify demographic models in a simulation.

Sampling time#

By default the samples that we draw from a Population are at the population’s default_sampling_time. See the Default sampling time section for more information about how this is set for populations in a demographic model.

We can manually control the time at which samples are drawn using list of SampleSet objects form for the samples argument.

In this example we create two diploid sample individuals, one at the present time and one taken 50 generations ago, representing one modern and one ancient individual. By default, when we draw the trees, the nodes that belong to samples are drawn as squares whereas nodes that are not samples are drawn as circles.

ts = msprime.sim_ancestry(
    samples = [
        msprime.SampleSet(1),
        msprime.SampleSet(1, time=50)
    ],
    population_size=100,
    random_seed=42
)
SVG(ts.draw_svg(y_axis=True))
_images/91b787063b98d7bb80d819d1fb202b5f3a02056a7ba56f6f4b91b82448880bb2.svg

See also

See the Order of event execution section for details of the relative order in which sampling events occur.

Sample details#

Sample individuals and nodes are allocated sequentially in the order that they are specified. For example:

demography = msprime.Demography.island_model([10, 10], migration_rate=1)
ts = msprime.sim_ancestry(
    samples=[
        msprime.SampleSet(1, population=1),
        msprime.SampleSet(2, population=0)],
    demography=demography,
    end_time=0)
print(ts.tables.individuals)
print(ts.tables.nodes)
╔══╤═════╤════════╤═══════╤════════╗
║id│flags│location│parents│metadata║
╠══╪═════╪════════╪═══════╪════════╣
║0 │    0│        │       │        ║
║1 │    0│        │       │        ║
║2 │    0│        │       │        ║
╚══╧═════╧════════╧═══════╧════════╝

╔══╤═════╤══════════╤══════════╤════╤════════╗
║id│flags│population│individual│time│metadata║
╠══╪═════╪══════════╪══════════╪════╪════════╣
║0 │    1│         1│         0│   0│        ║
║1 │    1│         1│         0│   0│        ║
║2 │    1│         0│         1│   0│        ║
║3 │    1│         0│         1│   0│        ║
║4 │    1│         0│         2│   0│        ║
║5 │    1│         0│         2│   0│        ║
╚══╧═════╧══════════╧══════════╧════╧════════╝

(Because we’re only interested in the sampled nodes and individuals we stopped the simulation from actually doing anything by setting end_time=0 — see the Setting the start time section for more information.) Here we define three sample individuals, and we therefore have three rows in the individual table. Because these are diploid individuals, the node table contains six sample nodes. If we look at the individual column in the node table we can see that the first two nodes correspond to individual 0, the next two nodes individual 1, etc. The sample configuration stated that the first sample should come from population 1 and the other two from population 0, and we can see this reflected in the population column of the node table. (Somewhat confusingly, population values are associated with nodes rather than individuals; this is mostly for historical reasons.)

The SampleSet class has a number of attributes which default to None. If these are set they will override the values which might be specified elsewhere. For example, we can specify mixed ploidy samples via the ploidy attribute of SampleSet:

ts = msprime.sim_ancestry(
    samples=[
        msprime.SampleSet(1, ploidy=3),
        msprime.SampleSet(2)],
    ploidy=1,
    end_time=0)
print(ts.tables.individuals)
print(ts.tables.nodes)
╔══╤═════╤════════╤═══════╤════════╗
║id│flags│location│parents│metadata║
╠══╪═════╪════════╪═══════╪════════╣
║0 │    0│        │       │        ║
║1 │    0│        │       │        ║
║2 │    0│        │       │        ║
╚══╧═════╧════════╧═══════╧════════╝

╔══╤═════╤══════════╤══════════╤════╤════════╗
║id│flags│population│individual│time│metadata║
╠══╪═════╪══════════╪══════════╪════╪════════╣
║0 │    1│         0│         0│   0│        ║
║1 │    1│         0│         0│   0│        ║
║2 │    1│         0│         0│   0│        ║
║3 │    1│         0│         1│   0│        ║
║4 │    1│         0│         2│   0│        ║
╚══╧═════╧══════════╧══════════╧════╧════════╝

(Again, we stop the simulation immediately because we’re only interested in the initial samples.) Here we have three sampled individuals again but in this they are mixed ploidy: the first individual is triploid and the other two are haploid.

Warning

It is vital to note that setting the ploidy value of SampleSet objects only affects sampling and does not affect the actual simulation. In this case, the simulation will be run on the haploid time scale. See the Ploidy section for more details.

If you wish to set up the node and individual IDs in some other way, it is possible to do so by using the initial_state parameter. See the Specifying the initial state for more information on how to use this (advanced) feature.

Demography#

A demographic model is a description of a set of populations, the migration rates between then and the events that modify the populations over time. Ancestry simulations can take one of these models as input via the demography parameter to sim_ancestry().

See also

Please see the Demographic models section for details on how to create and debug demographic models, and the Specifying samples section for help on how to define the locations and times of sampled individuals.

We don’t need to explicitly specify a Demography for every ancestry simulation, however. By default, we assume a single fixed-size population. For the default standard coalescent ancestry model we assume a population size of 1, so that time is measured in “coalescent units”. This is useful for theoretical work where we usually work in scaled coalescent time.

Important

By default msprime uses the diploid coalescent time scale; see the Ploidy section for more details.

Outside of theoretical analysis, working with scaled coalescent time can be confusing and it’s usually better to explicitly set the population size so that the output times are directly interpretable in generations. The population_size parameter to sim_ancestry() allows us to set the size of the single constant sized population in the default demography.

For example, here we run the same simulation twice, one with the default population size and one with a population size of 100 (this is the diploid population size because the default Ploidy is 2):

ts = msprime.sim_ancestry(2, random_seed=1234)
SVG(ts.draw_svg(y_axis=True))
_images/3073f40af0467850c4f2006fbad70613c396bbd8eeae384cdd7ae9b21a27b558.svg
ts = msprime.sim_ancestry(2, population_size=100, random_seed=1234)
SVG(ts.draw_svg(y_axis=True))
_images/c13d5f56f1cf9823556e342d1a0efe138f205272a68781d6f897a9a0d84a98f1.svg

We can see that the results are identical, except the node times have been scaled by the population size.

The Discrete Time Wright-Fisher model is different. Because it explicitly works with a population of N individuals we get an error if we don’t specify a population size:

msprime.sim_ancestry(2, model="dtwf", random_seed=1234)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[11], line 1
----> 1 msprime.sim_ancestry(2, model="dtwf", random_seed=1234)

File ~/work/tskit-site/tskit-site/msprime/ancestry.py:1290, in sim_ancestry(samples, demography, sequence_length, discrete_genome, recombination_rate, gene_conversion_rate, gene_conversion_tract_length, population_size, ploidy, model, initial_state, start_time, end_time, record_migrations, record_full_arg, additional_nodes, coalescing_segments_only, num_labels, random_seed, num_replicates, replicate_index, record_provenance)
   1265     parameters = dict(
   1266         command="sim_ancestry",
   1267         samples=samples,
   (...)
   1287         # replicate index is excluded as it is inserted for each replicate
   1288     )
   1289     provenance_dict = provenance.get_provenance_dict(parameters)
-> 1290 sim = _parse_sim_ancestry(
   1291     samples=samples,
   1292     sequence_length=sequence_length,
   1293     recombination_rate=recombination_rate,
   1294     gene_conversion_rate=gene_conversion_rate,
   1295     gene_conversion_tract_length=gene_conversion_tract_length,
   1296     discrete_genome=discrete_genome,
   1297     population_size=population_size,
   1298     demography=demography,
   1299     ploidy=ploidy,
   1300     model=model,
   1301     initial_state=initial_state,
   1302     start_time=start_time,
   1303     end_time=end_time,
   1304     record_migrations=record_migrations,
   1305     record_full_arg=record_full_arg,
   1306     additional_nodes=additional_nodes,
   1307     coalescing_segments_only=coalescing_segments_only,
   1308     num_labels=num_labels,
   1309     random_seed=random_seed,
   1310 )
   1311 return _wrap_replicates(
   1312     sim,
   1313     num_replicates=num_replicates,
   1314     replicate_index=replicate_index,
   1315     provenance_dict=provenance_dict,
   1316 )

File ~/work/tskit-site/tskit-site/msprime/ancestry.py:1009, in _parse_sim_ancestry(samples, sequence_length, recombination_rate, gene_conversion_rate, gene_conversion_tract_length, discrete_genome, population_size, demography, ploidy, model, initial_state, start_time, end_time, record_migrations, record_full_arg, additional_nodes, coalescing_segments_only, num_labels, random_seed, init_for_debugger)
   1005 if is_dtwf:
   1006     # A default size of 1 isn't so smart for DTWF and almost certainly
   1007     # an error.
   1008     if population_size is None:
-> 1009         raise ValueError(
   1010             "When using the DTWF model, the population size must be set "
   1011             "explicitly, either using the population_size or demography "
   1012             "arguments."
   1013         )
   1014 if initial_state is not None:
   1015     if population_size is None:

ValueError: When using the DTWF model, the population size must be set explicitly, either using the population_size or demography arguments.

Note

Because the DTWF models populations of individuals rather than using continuous time approximations, simulations with different population sizes will yield entirely different results; we are not simply rescaling time by using different population sizes. Furthermore, simulations with large population sizes will take relatively more CPU time and memory.


Ploidy#

The ploidy argument to sim_ancestry() has two effects:

  1. It sets the default ploidy, that is, the number of nodes per sample individual.

  2. For continuous time coalescent models, it sets the time scale.

Both of these are somewhat confusing, so let’s look at them one at a time.

Default number of nodes per sample individual#

As discussed in the Sample details section, the number of nodes (i.e., monoploid genomes) that we allocate per sample individual is determined by the ploidy. Here, we specify 2 sample individuals with a ploidy of 3:

ts = msprime.sim_ancestry(samples=2, ploidy=3, end_time=0)
print(ts.tables.individuals)
print(ts.tables.nodes)
╔══╤═════╤════════╤═══════╤════════╗
║id│flags│location│parents│metadata║
╠══╪═════╪════════╪═══════╪════════╣
║0 │    0│        │       │        ║
║1 │    0│        │       │        ║
╚══╧═════╧════════╧═══════╧════════╝

╔══╤═════╤══════════╤══════════╤════╤════════╗
║id│flags│population│individual│time│metadata║
╠══╪═════╪══════════╪══════════╪════╪════════╣
║0 │    1│         0│         0│   0│        ║
║1 │    1│         0│         0│   0│        ║
║2 │    1│         0│         0│   0│        ║
║3 │    1│         0│         1│   0│        ║
║4 │    1│         0│         1│   0│        ║
║5 │    1│         0│         1│   0│        ║
╚══╧═════╧══════════╧══════════╧════╧════════╝

Note

Because we’re only interested in the initial conditions here we set the end_time to zero: see the Setting the start time section for more details.

Our individual table has two rows (the number of samples) and node table has 6 rows: three for each of the sampled individuals.

Note that this is the default, and can be overridden by the SampleSet.ploidy value, as described in the Sample details section.

Coalescent time scales#

The ploidy argument also affects the time scale on which coalescence occurs, since it affects the total number of genomes in the population (see also population size). For example, consider the following two simulations, which are identical except for different ploidy values (note we use the SampleSet object to make sure we allocate the same number of nodes, as discussed in the Default number of nodes per sample individual section):

ts = msprime.sim_ancestry(
    samples=[msprime.SampleSet(4, ploidy=1)],
    population_size=1,
    ploidy=1,
    random_seed=1234
)
SVG(ts.draw_svg(y_axis=True))
_images/a8deb3d001de8a2923285df52ab3b49b8fda363032c8ca2d2c8a8a40d67d4413.svg
ts = msprime.sim_ancestry(
    samples=[msprime.SampleSet(4, ploidy=1)],
    population_size=1,
    ploidy=2,
    random_seed=1234
)
SVG(ts.draw_svg(y_axis=True))
_images/ee03351ca5d012b8344d6a0063aedc2dbb5ee54265ea1c0bc954d052aecefd4b.svg

The resulting tree sequences are identical, except that node time differ by a factor of 2, because in the underlying coalescent algorithm the total number of genomes in the population acts as a time scaling.

Todo

Add a discussion here about the Lambda coalescents an how things are more complicated for them.

Important

The Discrete Time Wright-Fisher model is an explicitly diploid model and it is therefore an error to run it with any ploidy value other than 2.

Genome properties#

Sequence length#

There are a number of ways to specify the length of the chromosome that we want to simulate. In the absence of recombination and gene conversion, we assume a genome of length 1:

ts = msprime.sim_ancestry(2, random_seed=1)
SVG(ts.draw_svg())
_images/72b253cb7e0e1e17186d87b5f5ca481d4e0ada890a8b97cf7f7805a1fd790dd8.svg

The sequence_length parameter determines the length of the simulated sequence:

ts = msprime.sim_ancestry(2, sequence_length=100, random_seed=1)
SVG(ts.draw_svg())
_images/71e574ece6fa8b4d2a6f7a215d0da7e5ae299964931c4a97db053a92668c6426.svg

By default there is no recombination or gene conversion, and so we therefore still have only one tree defining the ancestry of the samples. The sequence length makes little difference if we are considering only the ancestry of a non-recombining sequence, however, this will be important if we are interested in adding mutations later.

Recombination#

As we trace the ancestry of a sample of genomes backwards in time, recombination has the effect of creating segments of genome that have different (but highly correlated) genealogical trees. The sim_ancestry() function has a recombination_rate argument that allows us to flexibly specify both uniform and non-uniform rates of recombination along the genome.

The simplest way to run simulations with recombination is to use the recombination_rate and sequence_length parameters:

ts = msprime.sim_ancestry(
    3, recombination_rate=0.1, sequence_length=10, random_seed=2)
SVG(ts.draw_svg())
_images/f69d27abed270fb36dffa44791dcf9c7e7571da1d2708c05bb57affa5172f87a.svg

Here, we have four distinct trees along the genome, created by recombination events.

We can also provide a RateMap argument to recombination_rate:

rate_map = msprime.RateMap.uniform(sequence_length=10, rate=0.1)
other_ts = msprime.sim_ancestry(3, recombination_rate=rate_map, random_seed=2)
assert ts.equals(other_ts, ignore_provenance=True)

There are two things to note here:

  1. We didn’t need to specify a sequence_length argument to sim_ancestry(), as this information is encapsulated by the rate_map instance. (We can provide a value for sequence_length if we wish, but it must be equal to the sequence_length property of the RateMap).

  2. The result of the simulation for a Uniform rate maps rate map with the same rate and sequence_length is identical to the first simulation above.

We can simulate non-uniform recombination by defining our RateMap appropriately. For example, here we define a map in which the rate over the first half of the sequence is one tenth of the second half:

rate_map = msprime.RateMap(position=[0, 10, 20], rate=[0.01, 0.1])
ts = msprime.sim_ancestry(3, recombination_rate=rate_map, random_seed=2)
SVG(ts.draw_svg())
_images/f2b3bab632ebb05954382d982094949c53e9fa82031d511d73342d3e4e5f5c6f.svg

Because the recombination rate in from position 10 to 20 is so much higher than the rate from 0 to 10, we can see that all the recombinations have fallen in the high recombination region.

See also

Gene conversion#

Gene conversion events are defined by two parameters: the rate at which gene conversion events are initiated and the distribution of tract lengths. In the default case of discrete genome coordinates, tract lengths are drawn from a geometric distribution with mean gene_conversion_tract_length (which must be at least 1). Note that if we specify a tract length of 1, then all gene conversion tracts will have length exactly 1. In the following example one gene conversion event of length 1 has occurred:

ts = msprime.sim_ancestry(
    3, gene_conversion_rate=0.02, gene_conversion_tract_length=1,
    sequence_length=10, random_seed=3)
SVG(ts.draw_svg())
_images/594551ab5793662ae1acf51593b79289332c0fefb3e3a0283827e60b8fc736ee.svg

Note

We can see the specific effect of gene conversion here in this simple example, because the trees at either side of the event are identical. However, this will not be true in general.

Continuous genomes can also be used. In this case the parameters define the rate at which gene conversion events are initiated per unit of sequence length and the mean of the exponentially distributed gene conversion tract lengths. The following example shows the same simulation as above but for a continuous genome:

ts = msprime.sim_ancestry(
    3, gene_conversion_rate=0.02, gene_conversion_tract_length=1,
    sequence_length=10, random_seed=3, discrete_genome=False)
SVG(ts.draw_svg())
_images/30585c2bc8d366ac5833b3976974443f5bccc20c9b8417a08998426492cb1392.svg

Recombination and gene conversion at constant rates can be simulated alongside. In the following example recombinations at site 60 and 97 have occurred in addition to a gene conversion event covering the tract from site 76 to site 80.

ts = msprime.sim_ancestry(
    3, sequence_length = 100, recombination_rate=0.003,
    gene_conversion_rate=0.002, gene_conversion_tract_length=5,
    random_seed=6)
SVG(ts.draw_svg())
_images/13ee32edb33acbfd6e9764818c860f4699c7345a9619cb507dd98474ee670b90.svg

Variable recombination rates and constant gene conversion rates can also be combined.

Note

Variable rates of gene conversion breakpoint initiation are not currently supported, but are planned for a future release. If you are interested in this feature please let us know!

Discrete or continuous?#

By default, we assume that the genome we are simulating is discrete so that genome coordinates are at integer positions:

ts = msprime.sim_ancestry(
    3, recombination_rate=0.1, sequence_length=10, random_seed=2)
SVG(ts.draw_svg())
_images/f69d27abed270fb36dffa44791dcf9c7e7571da1d2708c05bb57affa5172f87a.svg

Alternatively, we can simulate a continous genome by setting discrete_genome=False:

ts = msprime.sim_ancestry(
    3, recombination_rate=0.25, sequence_length=1, discrete_genome=False,
    random_seed=33)
SVG(ts.draw_svg())
_images/788cd898c2563579023b291cedd69ab74ae578c66ba63e3b1a8d71eda966ad35.svg

Here we see that the breakpoints along the genome occur at floating point positions. Simulating a continuous genome sequence can be useful for theoretical work, but we recommend using discrete coordinates for most purposes.

Multiple chromosomes#

Do I need to do this?#

The main purpose of simulating multiple chromosomes together (instead of running independent simulations for each chromosome) is to capture correlations between trees on distinct chromosomes. In many scenarios, simulating multiple chromosomes may not be necessary, and independent simulations may be preferable as they will be much more efficient.

That being said, this section describes how to simulate data over multiple chromosomes, or more generally, over multiple regions with free recombination between them.

Using a fixed pedigree#

The most natural way to simulate multiple regions with free recombination for a set of sampled individuals is by simulating through a known pedigree. The fixed pedigree simulation model constrains simulated genomic regions by an input pedigree, which specifies parent-offspring relationships. Since autosomal genetic material in an offspring has equal chance of being inherited from either their mother or their father, independent simulations through the same pedigree ensures that those regions are inherited from different parents or from the same parent each with probability 1/2. This ensures free recombination between independent simulations.

We often do not have a full pedigree, but instead have a pedigree that extends some number of generations into the past. Because a recombination fraction of 1/2 between regions causes correlations between chromosomes to decay rapidly, even shallow pedigrees containing just a handful of generations should broadly capture expected levels of correlations. At the termination of the pedigree, the simulation can be completed using the DTWF, Hudson, or other model. See example 1 below for a small example of this procedure, and more details on completing pedigree simulations can be found in the Fixed pedigree section.

Why might simulating multiple chromosomes using a pedigree be preferable to other simulation options? First, there is no need to concatenate recombination maps for multiple chromosomes, as each region is simulated independently. This also results in simpler, reusable code which can be parallelized. Finally, this approach improves efficiency over a single larger simulation of multiple concatenated chromosomes, as computational costs scale as the square of the sequence length.

Simulations without a fixed pedigree#

Without a pedigree, we can use other simulation models in msprime such as DTWF and the multiple merger coalescent. These models do not directly support simulating multiple chromosomes simultaneously, but we can emulate it using a single linear genome split into chromosome segments. In this case, multiple chromosomes are modelled by specifying a recombination map with single base-pair segments with recombination probability 1/2 separating adjacent chromosomes. The probability that two chromosomes are co-inherited across \(t\) generations is \(2^{-t}\). In the Poisson model of recombination used by msprime, the probability of co-inheritance across \(t\) generations of two adjacent base pairs separated by a recombination rate \(r\) per base pair is \(e^{-rt}\). To make these agree, we set the recombination rate in the base pair segments separating chromosomes to \(\log(2)\).

Nelson et al. 2020 describes a few cases where simulating multiple chromosomes using the DTWF is important for accurately modeling shared patterns of variation between samples. In large simulated cohorts, where many samples can be close relatives, we need the DTWF model and multiple chromosomes to obtain the correct distribution of the number and total length of shared IBD segments. We similarly require multiple chromosomes and the DTWF model to get the correct ancestry patterns for very recently admixed populations.

Additionally, Multiple merger coalescents result in positively correlated ancestries between unlinked chromosomes (see Birkner et al. 2013). This correlation does not break down in time and multiple chromosomes should not be simulated independently under the multiple merger models. Unlinked chromosomes should scatter into distinct ancestors instantaneously under multiple merger coalescents, not with probability 1/2 per generation as in discrete population models. In msprime, this waiting time of length zero is approximated by an exponential distribution with rate \(r\), i.e. the recombination rate at the base pair separating the chromosomes, which should be set to a high enough value to obtain an acceptable approximation. We recommend considering in particular the relative error in comparison to the magnitudes other waiting times which are likely to arise in the simulation.

Important

Simulations of multiple chromosomes under either DTWF or the multiple merger models should use a discrete genome. While it is possible to set up such a simulation using a continuous genome, we strongly recommend against it, as chromosome divisions cannot be defined as discrete break points and the simulation will be very inefficient.

Example 1: simulating with a pedigree#

In this example, we simulate genomic data for a pair of first cousins in a small pedigree extending 3 generations into the past. See here for details on specifying, importing, and simulating using pedigrees.

Let’s first specify our pedigree:

import io
import tskit
ped_txt = """\
# id parent0 parent1 time is_sample
0   2   3   0.0 1
1   4   5   0.0 1
2   6   7   1.0 0
3   8   9   1.0 0
4   6   7   1.0 0
5   10  11  1.0 0
6   .   .   2.0 0
7   .   .   2.0 0
8   .   .   2.0 0
9   .   .   2.0 0
10  .   .   2.0 0
11  .   .   2.0 0
"""
pedigree = msprime.parse_pedigree(io.StringIO(ped_txt), sequence_length=100)

draw_pedigree(pedigree.tree_sequence())
_images/76e460bc1ad3b3fbb1e2d4ec32da76c97f586372c98bd002621b4c07c1db19e0.svg

Then suppose we would like to simulate three chromosomes of differing lengths and recombination rates. In practice, we might have recombination maps for each chromosome, but the general strategy remains the same.

Ls = [1000000, 2000000, 3000000]
rs = [1e-8, 2e-8, 3e-8]

ts_chroms = []
pedigree = msprime.parse_pedigree(io.StringIO(ped_txt), sequence_length=1)

for i, (L, r) in enumerate(zip(Ls, rs)):
    pedigree.sequence_length = L
    ped_ts = msprime.sim_ancestry(
        initial_state=pedigree, model="fixed_pedigree",
        recombination_rate=r, random_seed=i+1)
    ts_chroms.append(
        msprime.sim_ancestry(
            initial_state=ped_ts, population_size=1000,
            recombination_rate=r, model="dtwf", random_seed=i+100))

for i, ts in enumerate(ts_chroms):
    print(f"chromosome {i} has length {ts.sequence_length} and {ts.num_trees} trees")
chromosome 0 has length 1000000.0 and 48 trees
chromosome 1 has length 2000000.0 and 155 trees
chromosome 2 has length 3000000.0 and 377 trees

This results in three chromosomes of the desired lengths.

Example 2: simulating without a pedigree#

We first set up the chromosome coordinates, recombination rates, and the corresponding recombination map. The following defines a recombination map for three chromosomes each 1 cM in length:

import math

r_chrom = 1e-8
r_break = math.log(2)
chrom_positions = [0, 1e6, 2e6, 3e6]
map_positions = [
    chrom_positions[0],
    chrom_positions[1],
    chrom_positions[1] + 1,
    chrom_positions[2],
    chrom_positions[2] + 1,
    chrom_positions[3]
]
rates = [r_chrom, r_break, r_chrom, r_break, r_chrom]
rate_map = msprime.RateMap(position=map_positions, rate=rates)

We cannot use the default StandardCoalescent model to run simulations of multiple chromosomes; doing so would be equivalent to running independent replicate simulations and much less efficient. To get the correct correlation between chromosomes, we must use a model like discrete-time Wright-Fisher. However, because correlations between chromosomes break down very rapidly, we only need to simulate using dtwf for roughly 10-20 generations, after which we can switch to the hudson model to finish the simulation more efficiently (see the Specifying ancestry models section for more information on switching between ancestry models).

In this example, we simulate 10 sampled diploid individuals using the recombination map defined above for three chromosomes and switching from the dtwf to hudson model 20 generations ago:

ts = msprime.sim_ancestry(
    10,
    population_size=1000,
    recombination_rate=rate_map,
    model=[
        msprime.DiscreteTimeWrightFisher(duration=20),
        msprime.StandardCoalescent(),
    ],
    random_seed=1234,
)
ts
Tree Sequence
Trees351
Sequence Length3000000.0
Time Unitsgenerations
Sample Nodes20
Total Size60.3 KiB
MetadataNo Metadata
Table Rows Size Has Metadata
Edges 1254 39.2 KiB
Individuals 10 304 Bytes
Migrations 0 8 Bytes
Mutations 0 16 Bytes
Nodes 342 9.4 KiB
Populations 1 224 Bytes
Provenances 1 1.4 KiB
Sites 0 16 Bytes

To place each chromosome in its own separate tree sequence, we can trim the full tree sequence for each chromosome by defining the end points of each chromosome and using the keep_intervals() method. It is important to specify simplify=False so that node indices remain consistent across chromosomes. The trim() method is then used to trim the tree sequences to the focal chromosome, so that position indices begin at 0 within each chromosome tree sequence.

ts_chroms = []
for j in range(len(chrom_positions) - 1):
    start, end = chrom_positions[j: j + 2]
    chrom_ts = ts.keep_intervals([[start, end]], simplify=False).trim()
    ts_chroms.append(chrom_ts)
    print(chrom_ts.sequence_length)

# the third chromosome
chrom_ts
1000000.0
1000000.0
1000000.0
Tree Sequence
Trees102
Sequence Length1000000.0
Time Unitsgenerations
Sample Nodes20
Total Size26.7 KiB
MetadataNo Metadata
Table Rows Size Has Metadata
Edges 371 11.6 KiB
Individuals 10 304 Bytes
Migrations 0 8 Bytes
Mutations 0 16 Bytes
Nodes 342 9.4 KiB
Populations 1 224 Bytes
Provenances 3 2.3 KiB
Sites 0 16 Bytes

This gives us a list of tree sequences, one for each chromosome, in the order that they were stitched together in the initial recombination map.

Note

Many recombination maps inferred from data will contain regions at the start and end of the chromosome where recombination rates could not be inferred. These regions typically are assigned rate values of NaN. If we were to concatenate multiple recombination maps that have flanking NaN regions, the resulting maps would have regions with NaN rates in the interior of the map, which will produce an error when using such a map in simulations.

These regions typically represent unassay-able segments of the chromosome, and simulated genetic material in these regions should be discarded in downstream analyses. Thus, the suggested approach is to modify the recombination maps to set rates to zero in these flanking regions, concatenate the maps as shown in this example, and then remove those regions of the resulting tree sequences by calling ts.delete_intervals(nan_intervals), where nan_intervals is a list containing [start, end] values for each NaN flanking region.

Recording more information#

By default msprime stores the minimum amount of information required to represent the simulated genealogical history of the samples. Sometimes we are interested in more detailed information; this section gives details of options that allow us to do this.

Additional nodes#

In msprime we usually want to simulate the coalescent with recombination and represent the output as efficiently as possible. As a result, we don’t store individual recombination events, but rather their effects on the output tree sequence. We also do not explicitly store common ancestor events that do not result in marginal coalescences. For some purposes, however, we want record information on other events of interest, not just the mimimal representation of its outcome.

The additional_nodes and coalescing_segments_only options serve this exact purpose. These options allow us to record the nodes associated with a custom subset of all events we might observe in the history of a sample. Besides samples and coalescence events, nodes can now also represent common ancestor events, recombination, gene conversion, migration, and pass through events. The example below and the next few paragraphs provide a guide on how to record additional nodes and interpret the resulting node tables.

We first set up a simple pedigree simulation. Pedigrees are an interesting starting point as in contrast to the coalescent models a single node might be associated with more than one type of event. Conversely, for the continuous-time coalescent model simulated by default by msprime, each event is registered as a new node.

pb = msprime.PedigreeBuilder()
mp_21 = pb.add_individual(time=2)
dp_21 = pb.add_individual(time=2)
mp_11 = pb.add_individual(time=1, parents=[mp_21, dp_21])
dp_11 = pb.add_individual(time=1, parents=[mp_21, dp_21])
mp_12 = pb.add_individual(time=1, parents=[mp_21, dp_21])
dp_12 = pb.add_individual(time=1, parents=[mp_21, dp_21])
pb.add_individual(time=0, parents=[mp_11, dp_12], is_sample=True)
pb.add_individual(time=0, parents=[mp_11, dp_12], is_sample=True)
pb.add_individual(time=0, parents=[mp_12, dp_11], is_sample=True)
pedigree_tables = pb.finalise(sequence_length=5)
draw_pedigree(pedigree_tables.tree_sequence())
_images/b2e8ff838eaf328c120b0ee5d8ad8edf9b883528eac8eddf132eedc5099092e6.svg

To specify the particular node types we want to store information on, pass a NodeType object to the additional_nodes option of sim_ancestry(). This class extends enum.Flag. Each node type in the enumeration is associated with a numerical constant. Different node types can be combined and compared using bitwise operations. At the same time, these constant also function as the flags identifying the type of each node in the nodes table. Here, we take the bitwise union of three members of the enumeration: msprime.NodeType.RECOMBINANT, msprime.NodeType.PASS_THROUGH and msprime.NodeType.COMMON_ANCESTOR.

ts = msprime.sim_ancestry(
    initial_state=pedigree_tables,
    model='fixed_pedigree',
    recombination_rate=0.1,
    random_seed=45,
    additional_nodes=(
         msprime.NodeType.RECOMBINANT |
         msprime.NodeType.PASS_THROUGH | 
         msprime.NodeType.COMMON_ANCESTOR
         ),
    coalescing_segments_only=False
)

def count_flags(ts):
    counter = dict()
    for nodetype in msprime.NodeType:
        flags = np.bitwise_and(ts.nodes_flags, nodetype.value)
        counter[nodetype.name] = np.sum(flags > 0)

    return counter

print(count_flags(ts))
{'RECOMBINANT': 4, 'COMMON_ANCESTOR': 0, 'MIGRANT': 0, 'CENSUS': 0, 'GENE_CONVERSION': 0, 'PASS_THROUGH': 7}

The count_flags function demonstrates the use of bitwise operations to determine the type of a node. This function iterates across all node types and checks for all rows in the nodes table whether the intersection of the basic node type and the node within the nodes table is different from 0. It returns a dict containing the counts of all possible node types.

Note that recombination events are associated with two nodes, one for both ends that are created by recombination backwards in time. The number of recombination events is thus half the number of recombination nodes. PASS_THROUGH events occur when the ancestral material of only a single lineage passes through the genome of an ancestor, making coalescence impossible. This event can only be observed in models that track individuals: msprime.DiscreteTimeWrightFisher and msprime.FixedPedigree models.

node_style_map = {
    msprime.NodeType.RECOMBINANT.value: 'yellow',
    msprime.NodeType.COMMON_ANCESTOR.value: 'red',
    msprime.NodeType.PASS_THROUGH.value: 'blue',
    (msprime.NodeType.RECOMBINANT | msprime.NodeType.PASS_THROUGH).value : 'green',
    (msprime.NodeType.RECOMBINANT | msprime.NodeType.COMMON_ANCESTOR).value : 'orange'
}

css_string = ""
for node in ts.nodes():
    if node.flags in node_style_map:
        css_string += ".n%s > .sym {fill: %s}" % (node.id, node_style_map[node.flags])
wide_fmt = (800, 250)
ts.draw_svg(style=css_string, size=wide_fmt)
_images/43a71e8958273618c96963c07b05d847ea8e6ab4f6aed060f4ef215516f65cdb.svg

Similarly, the bitwise operations logic can be used to color the different nodes in the tree sequence according to their NodeType. Node 5 is both msprime.NodeType.RECOMBINANT and msprime.NodeType.PASS_THROUGH (green). Nodes 0, 1, 4 are recombinant nodes (yellow). All other nodes are only associated with a PASS_THROUGH event (blue). Note that we do not observe any COMMON_ANCESTOR events. This means that all yellow nodes are also associated with a coalesence event (see common ancestor events). Note that in order to verify whether a node is RECOMBINANT and PASS_THROUGH, we use the bitwise OR to define its associated NodeType constant.

To summarize: additional_nodes are specified using bitwise flags. Multiple basic node types can be combined by taking the bitwise OR (|) and then passed to the additional_nodes option. This enables us to track the specified nodes across the genealogical history of the sample. By means of bitwise AND (&) we can then query the nodes table and check the type of each of the recorded nodes.

If we wish to reduce these trees down to the minimal representation, we can use tskit.TreeSequence.simplify(). The resulting tree sequence will have all of these unary nodes removed and will be equivalent to (but not identical, due to stochastic effects) calling sim_ancestry() without the additional_nodes or coalescing_segment_only argument(s).

Coalescing segments only#

Note that the coalescing_segments_only option should be set to False when recording additional nodes. Setting coalescing_segments_only to False allows us to switch off the default simulator behaviour of only recording the relationships between overlapping ancestry segments. Instead, you can now record edges along the full extent of any of the ancestral lineages that was involved in any of the events as specified by the additional_nodes flag. This option can also be set to False without specifying any additional nodes. When coalescing_segments_only is set to False, edges that are a result of a coalescent event will record the full length of the tracked ancestral material, not just the overlapping segments of genome (as is the default behavior. As a result, this option will produce in nodes with only one child (unary nodes) along parts (unary regions) of the genome.

For instance: suppose an ancestral segment ancestral to node m spans from a to b along the genome, and another overlapping segment ancestral to node n spans from c to d, with a < c < b < d. If the two coalesce to create node p, if coalescing_segments_only=True (the default) then edges from m and n to p over the genomic segment [c,b) would be created, and the ancestry of the remaining segments (i.e., [a,c) for m and [b,d) for n) would be left unspecified until further coalescent events. However, if coalescing_segments_only=False, then the edges recorded would be from m to p over the entire segment [a, b), and from n to p over the entire segment [c,d). The nodes m and n coalesce (in p) on only the overlapping segment [c,b), and so the node p will be a unary node in the flanking regions: above m on the segment [a,c) and above n on the segment [b,d).

Common ancestor events#

We distinguish two types of common ancestor events: coalescence and common ancestor (in the strict sense) events.

Coalescence events are the default node type (associated with value 0) and are therefore only implicitly part of the NodeType enumeration. In contrast, whenever two nonoverlapping ancestral segments are found to have inherited from the same ancestral genome, there is no coalescence event, and so we register this in the node table as msprime.NodeType.COMMON_ANCESTOR. Finally, when keeping track of individuals (msprime.DiscreteTimeWrightFisher, msprime.FixedPedigree), we might observe genomes (ploids) through which the ancestral material of only a single lineage passes. Such genomes can be registered in the table as msprime.NodeType.PASS_THROUGH nodes. Although this is obviously not a common ancestor event, note that for both of these models each node has to carry at least one of these three flags.

Recombination events#

Additional nodes can also be used to mark recombination events (cross over) as well as gene conversion. A seperate msprime.NodeType exists for each: msprime.NodeType.RECOMBINANT and msprime.NodeType.GENE_CONVERSION.

A recombination event results in two extra nodes being recorded, one identifying the genome providing the genetic material to the left of the breakpoint and the other identifying the genome providing the genetic material to the right. For more information on how to set the recombination rate or specify a recombination map: see Recombination. More information on how to set up a simulation with gene conversion can be found here: Gene conversion.

Migration events#

When running simulations with demography, by default, migration events are not recorded in the resulting tree sequence. While the population associated with each node is determined by the simulated migration events, knowledge of the precise sequence of migrations is useful in a number of situations: for example see the tskit tutorial on introgression.

The record_migrations parameter allows us to record these events in the simulated tree sequence outputted by sim_ancestry(). Briefly, a migration record encodes the node associated with the migration, the time of the event, the source and destination populations (backwards in time), and the left and right coordinates of the migrating segement. For more details on migration records, see the migration table definition.

Here, we provide a simple example of the effect of setting record_migrations=True using the stepping stone model where migration is permitted between adjacent populations. Additionally, we set additional_nodes=msprime.NodeType.MIGRANT (see previous section) to record nodes corresponding to migration events. This is not necessary, but will be helpful to visualise the result.

N = 10
demography = msprime.Demography.stepping_stone_model(
    [20] * N, migration_rate=0.1, boundaries=True
)
ts = msprime.sim_ancestry(
    {0: 1, N - 1: 1},
    demography=demography,
    record_migrations=True,
    additional_nodes=msprime.NodeType.MIGRANT,
    coalescing_segments_only=False,
    random_seed=6,
)

In the following visualisation, we trace the migration history of two samples backwards in time to their most recent common ancestor. Each one of the ten populations in the model is assigned a colour and a position on the x axis. The time of migration events is recorded on the y axis.

import tskit
import numpy as np
import matplotlib

cmap = matplotlib.cm.get_cmap('viridis')
tree = ts.first()
mig_color_dict = {}
for u, linestyle in zip(ts.samples()[::2], ["dotted", "dashdot"]):
    mig = ts.tables.migrations
    loc = []
    time = []
    while u != tskit.NULL:
        migs = np.where(mig.node == u)[0]
        for cur_mig in migs:
            cur_mig = mig[cur_mig]
            loc.append(cur_mig.dest + 0.5)
            time.append(cur_mig.time)
        u = tree.parent(u)
    plt.plot(loc, time, linestyle=linestyle, color="black")
for i in range(N):
    rgba = cmap(i / N)
    mig_color_dict[i] = np.array(rgba)*255
    plt.axvspan(i, i+1, facecolor=rgba, alpha=0.5)
plt.ylabel("Time of migration event")
plt.xlabel("Population")
plt.show()
/tmp/ipykernel_2580/2194563639.py:5: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
  cmap = matplotlib.cm.get_cmap('viridis')
_images/b56620aa65e40c097f11ed11dc1d1afd86d4c8496bcecf95a5732e1b36f0079d.svg

Note also how many migration events occurred before the two samples coalesced, even in this small example.

The next visualisation shows where each of these migrations fall on the edges of the simulated tree. We show the nodes associated with each migration event, colouring nodes and edges by the population with which they are associated (using the same colour scheme as the previous visualisation).

mig_color_dict = {}
cmap = matplotlib.cm.get_cmap("viridis")
for i in range(N):
    rgba = cmap(i / N)
    mig_color_dict[i] = np.array(rgba) * 255

edge_colors = {}
string = ".tree .edge {stroke-width:4px}"
pop = ts.tables.nodes.population
for node in ts.nodes():
    node = node.id
    edge_colors[node] = mig_color_dict[pop[node]]
    string += (
        ".node.n"
        + str(node)
        + "> .sym {fill: rgb"
        + str(tuple(mig_color_dict[pop[node]][0:3]))
        + "}"
    )
    string += (
        ".node.n"
        + str(node)
        + " > .edge {stroke: rgb"
        + str(tuple(mig_color_dict[pop[node]][0:3]))
        + "}"
    )
for mig in ts.migrations():
    edge_colors[mig.node] = mig_color_dict[mig.source]
    string += (
        ".node.n"
        + str(mig.node)
        + "> .sym {fill: rgb"
        + str(tuple(mig_color_dict[pop[mig.node]][0:3]))
        + "}"
    )
    string += (
        ".node.n"
        + str(mig.node)
        + " > .edge {stroke: rgb"
        + str(tuple(mig_color_dict[mig.dest][0:3]))
        + "}"
    )
node_labels = {node.id: "" for node in ts.nodes()}
SVG(
    ts.first().draw_svg(
        size=(500, 400), style=string, node_labels=node_labels, symbol_size=10
    )
)
/tmp/ipykernel_2580/1098007938.py:2: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
  cmap = matplotlib.cm.get_cmap("viridis")
_images/06fcc6eb93c8b89c4357013906e3b15e1eb57ba8f9bafbe5d887e4bac37aee14.svg

Census events#

Census events allow you to add a node to each branch of the tree sequence at a given time during the simulation. This can be useful when you wish to study haplotypes that are ancestral to your simulated sample, or when you wish to know which lineages were present in which populations at specified times.

For instance, the following example creates a two population island model Demography, with the populations exchanging migrants at rate 0.05. At generation 5000, we add a census event using the Demography.add_census() method to determine where each of the lineages is at that time:

demography = msprime.Demography.island_model([1000, 1000], migration_rate=0.05)
demography.add_census(time=5000)
ts = msprime.sim_ancestry(
    {0: 1, 1: 1},
    demography=demography,
    recombination_rate=1e-7,
    sequence_length=1000,
    random_seed=141
)
css_string = ""
for node in ts.nodes():
    if node.time == 5000:
        css_string += ".n%s > .sym {fill: %s}" % (node.id, 'green')
SVG(ts.draw_svg(style=css_string))
_images/9334ead1a084ecf7d215a95f3f13071ceb6d8dc8b23d0b8ce8cd6ce5913304f2.svg

The resulting tree sequence has nodes on each tree at the specified census time. These are the nodes with IDs 8, 9, 10, 11, 12 and 13.

This tells us that the genetic material ancestral to the present day sample was held within 5 haplotypes at time 5000. The node table shows us that four of these haplotypes (nodes 8, 9, 10 and 11) were in population 0 at this time, and two of these haplotypes (nodes 12 and 13) were in population 1 at this time.

print(ts.tables.nodes)
census_nodes = np.bitwise_and(ts.nodes_flags, msprime.NodeType.CENSUS.value) > 0  
print(ts.tables.nodes[census_nodes])
╔══╤═══════╤══════════╤══════════╤═════════════╤════════╗
║id│flags  │population│individual│time         │metadata║
╠══╪═══════╪══════════╪══════════╪═════════════╪════════╣
║0 │      1│         0│         0│   0.00000000│        ║
║1 │      1│         0│         0│   0.00000000│        ║
║2 │      1│         1│         1│   0.00000000│        ║
║3 │      1│         1│         1│   0.00000000│        ║
║4 │      0│         1│        -1│2331.91109774│        ║
║5 │      0│         1│        -1│3741.02811877│        ║
║6 │      0│         0│        -1│4216.80416680│        ║
║7 │      0│         1│        -1│4580.66488183│        ║
║8 │1048576│         0│        -1│5000.00000000│        ║
║9 │1048576│         0│        -1│5000.00000000│        ║
║10│1048576│         0│        -1│5000.00000000│        ║
║11│1048576│         0│        -1│5000.00000000│        ║
║12│1048576│         1│        -1│5000.00000000│        ║
║13│1048576│         1│        -1│5000.00000000│        ║
║14│      0│         1│        -1│5232.65267392│        ║
║15│      0│         0│        -1│8192.48105714│        ║
╚══╧═══════╧══════════╧══════════╧═════════════╧════════╝

╔══╤═══════╤══════════╤══════════╤════╤════════╗
║id│flags  │population│individual│time│metadata║
╠══╪═══════╪══════════╪══════════╪════╪════════╣
║0 │1048576│         0│        -1│5000│        ║
║1 │1048576│         0│        -1│5000│        ║
║2 │1048576│         0│        -1│5000│        ║
║3 │1048576│         0│        -1│5000│        ║
║4 │1048576│         1│        -1│5000│        ║
║5 │1048576│         1│        -1│5000│        ║
╚══╧═══════╧══════════╧══════════╧════╧════════╝

If we wish to study these ancestral haplotypes further, we can simplify the tree sequence with respect to the census nodes and perform subsequent analyses on this simplified tree sequence. In this example, ts_anc is a tree sequence obtained from the original tree sequence ts by labelling the census nodes as samples and removing all nodes and edges that are not ancestral to these census nodes.

nodes = np.where(census_nodes)[0]
print(nodes)
ts_anc = ts.simplify(samples=nodes)
[ 8  9 10 11 12 13]

Using the DTWF model#

Census events assume that they are being used as part of a continuous time simulation process, in which multiple events are guaranteed not to happen at the same time. However, the DiscreteTimeWrightFisher model does not meet these assumptions, and allows multiple events (such as coalescences, migrations etc) to all occur at the same time. This includes census events, which can lead to errors when used in conjunction with the DTWF model.

Consider the following example:

demography = msprime.Demography()
demography.add_population(name="A", initial_size=10)
ts = msprime.sim_ancestry({"A": 3}, demography=demography, model="dtwf", random_seed=33)
SVG(ts.draw_svg(y_axis=True, time_scale="log_time"))
_images/4dfa5f19369876df91ccdd4094618b4d9eb9d8ab36d4fea30a0047e35beab8cb.svg

We can see that a coalescence happens at time 3, giving node 7. Now, let’s try to perform a census at time 3:

demography = msprime.Demography()
demography.add_population(name="A", initial_size=10)
demography.add_census(time=3)
ts = msprime.sim_ancestry({"A": 3}, demography=demography, model="dtwf", random_seed=33)
---------------------------------------------------------------------------
LibraryError                              Traceback (most recent call last)
Cell In[47], line 4
      2 demography.add_population(name="A", initial_size=10)
      3 demography.add_census(time=3)
----> 4 ts = msprime.sim_ancestry({"A": 3}, demography=demography, model="dtwf", random_seed=33)

File ~/work/tskit-site/tskit-site/msprime/ancestry.py:1311, in sim_ancestry(samples, demography, sequence_length, discrete_genome, recombination_rate, gene_conversion_rate, gene_conversion_tract_length, population_size, ploidy, model, initial_state, start_time, end_time, record_migrations, record_full_arg, additional_nodes, coalescing_segments_only, num_labels, random_seed, num_replicates, replicate_index, record_provenance)
   1289     provenance_dict = provenance.get_provenance_dict(parameters)
   1290 sim = _parse_sim_ancestry(
   1291     samples=samples,
   1292     sequence_length=sequence_length,
   (...)
   1309     random_seed=random_seed,
   1310 )
-> 1311 return _wrap_replicates(
   1312     sim,
   1313     num_replicates=num_replicates,
   1314     replicate_index=replicate_index,
   1315     provenance_dict=provenance_dict,
   1316 )

File ~/work/tskit-site/tskit-site/msprime/ancestry.py:681, in _wrap_replicates(simulator, num_replicates, replicate_index, provenance_dict, mutation_rate)
    675 iterator = simulator.run_replicates(
    676     num_replicates,
    677     mutation_rate=mutation_rate,
    678     provenance_dict=provenance_dict,
    679 )
    680 if replicate_index is not None:
--> 681     deque = collections.deque(iterator, maxlen=1)
    682     return deque.pop()
    683 else:

File ~/work/tskit-site/tskit-site/msprime/ancestry.py:1606, in Simulator.run_replicates(self, num_replicates, mutation_rate, provenance_dict)
   1604 for replicate_index in range(num_replicates):
   1605     logger.info("Starting replicate %d", replicate_index)
-> 1606     self.run()
   1607     if mutation_rate is not None:
   1608         # This is only called from simulate() or the ms interface,
   1609         # so does not need any further parameters.
   1610         mutations._simple_mutate(
   1611             tables=self.tables,
   1612             random_generator=self.random_generator,
   1613             sequence_length=self.sequence_length,
   1614             rate=mutation_rate,
   1615         )

File ~/work/tskit-site/tskit-site/msprime/ancestry.py:1571, in Simulator.run(self, event_chunk, debug_func)
   1569     raise ValueError("Model durations must be >= 0")
   1570 end_time = min(self.time + model_duration, self.end_time)
-> 1571 exit_reason = self._run_until(end_time, event_chunk, debug_func)
   1572 if exit_reason == ExitReason.COALESCENCE or self.time == self.end_time:
   1573     logger.debug("Skipping remaining %d models", len(self.models) - j - 1)

File ~/work/tskit-site/tskit-site/msprime/ancestry.py:1536, in Simulator._run_until(self, end_time, event_chunk, debug_func)
   1534 ret = ExitReason.MAX_EVENTS
   1535 while ret == ExitReason.MAX_EVENTS:
-> 1536     ret = ExitReason(super().run(end_time, event_chunk))
   1537     if self.time > end_time:
   1538         # Currently the Pedigree and Sweeps models are "non-reentrant"
   1539         # We can change this to an assertion once these have been fixed.
   1540         raise RuntimeError(
   1541             f"Model {self.model['name']} does not support interruption. "
   1542             "Please open an issue on GitHub"
   1543         )

LibraryError: The simulation model supplied resulted in a parent node having a time value <= to its child. This can occur as a result of multiple bottlenecks happening at the same time, multiple census events at the same time or numerical imprecision with very smallpopulation sizes.

We got an error from msprime about a parent node having a time value <= to its child, which occurred because we tried to add set of census nodes at time 3, but node 7 was already at time 3, giving a zero branch length (which is disallowed by tskit).

The solution is to use non-integer time values when performing a census under the DTWF:

demography = msprime.Demography()
demography.add_population(name="A", initial_size=10)
demography.add_census(time=3.5)
ts = msprime.sim_ancestry({"A": 3}, demography=demography, model="dtwf", random_seed=33)
SVG(ts.draw_svg(y_axis=True, time_scale="log_time"))
_images/743025a50bda06241781b0002622a2ba0c56bbf6218c4e9b3fa80d2cdc27a12b.svg

Ancestral recombination graph#

This is a legacy option and identical to setting the additional_nodes option to store common_ancestor events, recombinants, (and migrants if applicable). This is illustrated in the following example:

ts = msprime.sim_ancestry(
    3, recombination_rate=0.1, sequence_length=2,
    record_full_arg=True, random_seed=42)
additional_nodes = msprime.NodeType.COMMON_ANCESTOR | msprime.NodeType.RECOMBINANT
ts_other = msprime.sim_ancestry(
    3, recombination_rate=0.1, sequence_length=2,
    additional_nodes=additional_nodes,
    coalescing_segments_only=False,
    random_seed=42)
print(ts.tables.equals(ts_other.tables, ignore_provenance=True))
print(ts.tables.nodes)
SVG(ts.draw_svg())
True
╔══╤══════╤══════════╤══════════╤══════════╤════════╗
║id│flags │population│individual│time      │metadata║
╠══╪══════╪══════════╪══════════╪══════════╪════════╣
║0 │     1│         0│         0│0.00000000│        ║
║1 │     1│         0│         0│0.00000000│        ║
║2 │     1│         0│         1│0.00000000│        ║
║3 │     1│         0│         1│0.00000000│        ║
║4 │     1│         0│         2│0.00000000│        ║
║5 │     1│         0│         2│0.00000000│        ║
║6 │     0│         0│        -1│0.21230674│        ║
║7 │     0│         0│        -1│0.51485157│        ║
║8 │     0│         0│        -1│0.71161449│        ║
║9 │131072│         0│        -1│0.91107705│        ║
║10│131072│         0│        -1│0.91107705│        ║
║11│262144│         0│        -1│1.21743776│        ║
║12│     0│         0│        -1│1.23130397│        ║
║13│     0│         0│        -1│4.80416300│        ║
╚══╧══════╧══════════╧══════════╧══════════╧════════╝
_images/2a6bc0f3f39d8fccdb7c46270f3ce2ce6d441ffc52e2a92b58588206de572fb1.svg

After running the simulation we first print out the node table, which contains information on all the nodes in the tree sequence. Note that flags column contains several different values: all of the sample nodes (at time 0) have a flag value of 1 (tskit.NODE_IS_SAMPLE). Most other nodes have a flag value of 0, which is the standard for internal nodes in a coalescent simulations.

Nodes 9 and 10 have flags equal to 131072 (NodeType.RECOMBINANT), which tells us that they correspond to a recombination event in the ARG. A recombination event results in two extra nodes being recorded, one identifying the individual providing the genetic material to the left of the breakpoint and the other identifying the individuals providing the genetic material to the right. The effect of this extra node can be seen in the trees: node 9 is present as a ‘unary’ node in the left hand tree and node 10 in the right.

In this particular case, the first thing that happens to these two lineages as we go up the tree sequence is that they coalesce straight back together again at node 11, forming a (normally undetectable) “diamond event” in the ARG, and explaining why the topology of both these trees appears the same. In a full ARG, this is one of many ways that a coalescent event can occur without resulting in a marginal coalescence, and which instead results in an additional unary node in the trees. Such nodes are given a flags value of 262144 (NodeType.COMMON_ANCESTOR).

If we wish to reduce these trees down to the minimal representation, we can use tskit.TreeSequence.simplify(). The resulting tree sequence will have all of these unary nodes removed and will be equivalent to (but not identical, due to stochastic effects) calling sim_ancestry() without the record_full_arg argument.

Migrations nodes are also recorded in the ARG using the NodeType.MIGRANT flag. See the Node flags section for more details.

Manipulating simulation time#

Stopping simulations early#

In most simulations we want to simulate a complete history for our samples: that is, to when we have a common ancestor at every point along the sequence. For example, here we generate a complete ancestral history for two diploid samples:

ts = msprime.sim_ancestry(2, sequence_length=10, recombination_rate=0.1, random_seed=42)
SVG(ts.draw_svg(y_axis=True, y_ticks=[0, 1, 2, 3, 4, 5]))
_images/3a3ca8f5bc0fc7f88afca76ada5a80fafc477749103217b4fe30086cea588250.svg

Sometimes we would like to stop the simulation early, before complete coalescence has occurred; perhaps we would like to combine several simulations with different properties, or we are really only interested in the recent past and simulating more ancient processes would be a waste of computational resources.

We can do this by specifying a end_time value. Let’s repeat the simulation as above with an end_time of 2:

ts = msprime.sim_ancestry(
    2, sequence_length=10, recombination_rate=0.1, end_time=2, random_seed=42)
SVG(ts.draw_svg(y_axis=True, y_ticks=[0, 1, 2]))
_images/c9b6d1230dac330d151705fe1cb64c1eb94798126d41db355a05aa6815723bbf.svg

There are a number of important things to observe about this tree sequence:

  1. The history up until time 2 is identical to the original simulation.

  2. The first tree has not fully coalesced, and we therefore have multiple roots. Trees with multiple roots are fully supported in tskit.

  3. The nodes 8 and 9 are at time 2 (our end_time value), and these connect to the uncoalesced portions of the tree. These “unary” nodes are very important for ensuring the correct statistical properties when we combine multiple simulations.

See also

The ancestry model duration can also be used to stop simulations before coalescence.

For discrete time models, the exact interpretation of the end_time is important — we continue to simulate the process while the time is <= end_time. For example here we use an end_time of 1 generation in a Discrete Time Wright-Fisher simulation:

ts = msprime.sim_ancestry(3, population_size=10, model="dtwf", end_time=1, random_seed=42)
SVG(ts.draw_svg(y_axis=True))
_images/0fc20412839f693c52b075b4769b3fb8bce1c39d7fff232ebe050a716a3e8611.svg

We have run 1 generation of the DTWF, resulting in the coalescence at node 9.

Setting the start time#

Important

Setting simulation start time is an advanced feature which is rarely needed in practice.

Sometimes we need control when the simulation process starts. For example, given the initial configuration of at time zero we want to “jump ahead” to a later time point before we start generating random events. Consider the following example:

ts = msprime.sim_ancestry(2, random_seed=42)
ts.tables.nodes
idflagspopulationindividualtimemetadata
01000.00000000
11000.00000000
21010.00000000
31010.00000000
400-10.15642269
500-10.29152170
600-12.11740681

Now we run the same simulation with a start_time of 10:

ts = msprime.sim_ancestry(2, start_time=10, random_seed=42)
ts.tables.nodes
idflagspopulationindividualtimemetadata
01000.00000000
11000.00000000
21010.00000000
31010.00000000
400-110.15642269
500-110.29152170
600-112.11740681

We can see that the results are identical except that 10 has been added to all of the internal node times.

Important

It is important to note that the start_time only controls the “random” events that happen within the simulation model: events that are specified to occur at a given time (such as sampling or demographic events) will occur at that time, even if this is before the start_time.

Consider the following example in which we create a simple population tree demography, and then take samples at different times all before start_time:

demography = msprime.Demography()
demography.add_population(name="A", initial_size=100)
demography.add_population(name="B", initial_size=100)
demography.add_population(name="C", initial_size=100)
demography.add_population_split(time=5, derived=["A", "B"], ancestral="C")

ts = msprime.sim_ancestry(
    samples=[
        msprime.SampleSet(1, population="A", time=1),
        msprime.SampleSet(1, population="B", time=2),
        msprime.SampleSet(1, population="C", time=6),
    ],
    demography=demography,
    ploidy=1,
    start_time=10,
    random_seed=1)

ts.tables.nodes
idflagspopulationindividualtimemetadata
01001.00000000
11112.00000000
21226.00000000
302-152.47084127
402-166.18169933

We can see that the samples we specify are in the correct populations and their node times are at the sampling time. Note that all the coalescences happen in population ID 2 (“C”), indicating that the lineages from “A” and “B” moved there at the time of the split (which we can observe directly using the record_migrations option if we wish). Note also that we drew a sample from population “C” at time 6, which is after the split; if we tried to draw a sample before the split, we would get an error because the population would be inactive.

See also

See the Order of event execution section for details of the precise order in which fixed-time events occur.

Note

The same considerations apply when we are running simulations based on a given initial state; in this case, the start_time is by default set to the time of the youngest root in the input trees.

Order of event execution#

There are two different classes of event that happen in ancestry simulations: “fixed” and “random” events. Random events are generated by the ancestry model, and fixed events are those that occur at a time specified by the user. Fixed events are demographic events or sampling events. If multiple events are specified to occur at exactly the same time, then:

  1. All demographic events at this time are run, in the order they were specified.

  2. Sampling events occur immediately after all the demographic events have been executed. (See the Sampling time section for more details on ancient sampling events.)

These details are slightly different for the Discrete Time Wright-Fisher model. See the API documentation for details.

Specifying the initial state#

By default msprime simulations are initialised by specifying a set of samples, which sets up the simulation with segments of ancestral material covering the whole sequence. Internally, this initial state is implemented by setting up a tskit.TableCollection which contains the population, node, and individual information. The low-level simulation code then reads these tables and sets up the simulation accordingly. We can see what this initial state looks like by setting up a simple simulation and stopping it immediately:

initial_ts = msprime.sim_ancestry(1, end_time=0)
initial_ts
Tree Sequence
Trees1
Sequence Length1.0
Time Unitsgenerations
Sample Nodes2
Total Size1.4 KiB
MetadataNo Metadata
Table Rows Size Has Metadata
Edges 0 8 Bytes
Individuals 1 52 Bytes
Migrations 0 8 Bytes
Mutations 0 16 Bytes
Nodes 2 64 Bytes
Populations 1 224 Bytes
Provenances 1 1015 Bytes
Sites 0 16 Bytes

We can see that the returned tree sequence has one population, two sample nodes and one individual (since sample individuals are diploid by default). There are no edges because the simulation didn’t run for any period of time: we’re essentially just returning the initial state used to start the simulation.

Then, running a simulation in which the initial_state is equal to initial_ts is precisely the same as running the simulation with the same arguments as we used above:

ts1 = msprime.sim_ancestry(initial_state=initial_ts, population_size=100, random_seed=2)
ts2 = msprime.sim_ancestry(1, population_size=100, random_seed=2)
assert ts1.equals(ts2, ignore_provenance=True)

Warning

Note that the initial_state tree sequence only contains the information about the ancestral histories, and not any of the simulation parameters. In the previous example we used default parameters to keep things simple, but in general the full simulation model will need to be specified as well as the initial state. In particular, see the Interaction with demography section.

While it is possible to specify the initial state manually by constructing the input tree sequence, it is an advanced topic and rarely necessary. More often we are interested in specifying the initial state of an msprime simulation using the output of another simulation.

Note

The initial_state tree sequence is quite literally the initial state of the tree sequence that is returned at the end of a simulation, and that any information present (including metadata) will not be altered.

Combining backward-time simulations#

Suppose that we wished to simulate a scenario in which we have recombination happening at some rate in the recent past, and in the more distant past there is no recombination. We can do this by combining the simulations, using one as the initial_state for the other.

We first run the simulation of the recent past:

ts1 = msprime.sim_ancestry(
    2, recombination_rate=0.01, sequence_length=10, end_time=10,
    population_size=100, random_seed=1)
SVG(ts1.draw_svg(y_axis=True, time_scale="log_time", y_ticks=[0, 1, 2, 5, 10]))
_images/4d0ac5d95b21d9a759b4853b85ecf2f5af7b8afd0d086184a4fdd046458ea661.svg

Some recombination and coalescence has happened and we have three marginal trees along the genome. None of these trees has fully coalesced, and so each of the trees has multiple roots.

Important

The set of nodes at time 10 connecting to all the internal nodes in the trees are very important: without these nodes and unary edges, the statistical properties of the combined simulations would not be correct! (See also section below on forward simulations.)

We then run the next phase of the simulation in which there is no recombination by passing ts1 as the argument to initial_state:

ts2 = msprime.sim_ancestry(initial_state=ts1, population_size=100, random_seed=3)
SVG(ts2.draw_svg(y_axis=True, time_scale="log_time", y_ticks=[0, 10, 20, 50, 100, 200, 500]))
_images/26e54ba45161d29820cc774283e47072cd9bd53f64b856bccbc1c9a571ccb10c.svg

Since there is no recombination in this more ancient simulation, node 12 is the MRCA in all three trees. The unary nodes 6, 7, 8, and 9 marking the transition between the two simulation regimes are still present in the trees. If you wish to remove these, we can use the tskit.TreeSequence.simplify() method:

ts_simplified = ts2.simplify()
SVG(ts_simplified.draw_svg())
_images/3301261e3dc90899f82a3489c7615e6bf57e1d36e5865cfe1445c7eaa84f3488.svg

Interaction with demography#

In the previous example we saw how to combine two single population simulations. We can also combine more complex simulations involving demography, but care needs to be taken to ensure that the correct models are simulated in each part of the simulation. Suppose we have simple model with a population split:

demography = msprime.Demography()
demography.add_population(name="A", description="Contemporary population A", initial_size=100)
demography.add_population(name="B", description="Contemporary population B", initial_size=200)
demography.add_population(name="C", description="Ancestral population", initial_size=300)
demography.add_population_split(time=100, ancestral="C", derived=["A", "B"])
demography
Populations (3)
idnamedescriptioninitial_sizegrowth_ratedefault_sampling_timeextra_metadata
0AContemporary population A100.000{}
1BContemporary population B200.000{}
2CAncestral population300.001e+02{}
Migration matrix (all zero)
Events (1)
timetypeparameterseffect
100Population Splitderived=[A, B], ancestral=CMoves all lineages from derived populations 'A' and 'B' to the ancestral 'C' population. Also set the derived populations to inactive, and all migration rates to and from the derived populations to zero.

Then, we run a simulation of this model for 50 generations:

ts1 = msprime.sim_ancestry(
    samples={"A": 1, "B": 1}, demography=demography, end_time=50, random_seed=1234)
node_labels = {
    node.id: f"{node.id}:{ts1.population(node.population).metadata['name']}"
    for node in ts1.nodes()}
SVG(ts1.draw_svg(node_labels=node_labels, y_axis=True))
_images/86f4798c35fd20d41285b6e1285e39704df98d9a374c84ec25daf47710b0b052.svg

After 50 generations we can see there has been a coalescence in population A but none in B and we therefore have three lineages left when the simulation finishes. We can then continue the simulation based on this initial state:

ts2 = msprime.sim_ancestry(
    initial_state=ts1, demography=demography, random_seed=5678)
node_labels = {
    node.id: f"{node.id}:{ts2.population(node.population).metadata['name']}"
    for node in ts2.nodes()}
SVG(ts2.draw_svg(node_labels=node_labels, y_axis=True))
_images/dbd12c1c430e5aa4629cedd42e15073f72aff886527ada0b7ad970cb350fc4e5.svg

Note that we use the same demography object, and so the lineages migrate and ultimately coalesce in population C, as we’d expect. This is the simplest case, in which we already have the demography object which we can use to model the entire simulation from end-to-end.

In other cases (e.g. when working with simulations from a different program), we don’t have the Demography object at hand and we need to create one that is both compatible with the initial_state tree sequence and reflects the demographic model that we are interested in using. The best way to do this is to use the Demography.from_tree_sequence() method to first get a base demography that is based on the tree sequence population table and metadata:

demography = msprime.Demography.from_tree_sequence(ts1)
demography
Populations (3)
idnamedescriptioninitial_sizegrowth_ratedefault_sampling_timeextra_metadata
0AContemporary population A0.000{}
1BContemporary population B0.000{}
2CAncestral population0.000{}
Migration matrix (all zero)
Events (0)

We can see that the population names and descriptions have been recovered from the tree sequence metadata but no other information: the population sizes are all zero and we no longer have a population split event. If we try to run a simulation using this demography we get an error:

ts = msprime.sim_ancestry(initial_state=ts1, demography=demography)
---------------------------------------------------------------------------
InputError                                Traceback (most recent call last)
Cell In[65], line 1
----> 1 ts = msprime.sim_ancestry(initial_state=ts1, demography=demography)

File ~/work/tskit-site/tskit-site/msprime/ancestry.py:1290, in sim_ancestry(samples, demography, sequence_length, discrete_genome, recombination_rate, gene_conversion_rate, gene_conversion_tract_length, population_size, ploidy, model, initial_state, start_time, end_time, record_migrations, record_full_arg, additional_nodes, coalescing_segments_only, num_labels, random_seed, num_replicates, replicate_index, record_provenance)
   1265     parameters = dict(
   1266         command="sim_ancestry",
   1267         samples=samples,
   (...)
   1287         # replicate index is excluded as it is inserted for each replicate
   1288     )
   1289     provenance_dict = provenance.get_provenance_dict(parameters)
-> 1290 sim = _parse_sim_ancestry(
   1291     samples=samples,
   1292     sequence_length=sequence_length,
   1293     recombination_rate=recombination_rate,
   1294     gene_conversion_rate=gene_conversion_rate,
   1295     gene_conversion_tract_length=gene_conversion_tract_length,
   1296     discrete_genome=discrete_genome,
   1297     population_size=population_size,
   1298     demography=demography,
   1299     ploidy=ploidy,
   1300     model=model,
   1301     initial_state=initial_state,
   1302     start_time=start_time,
   1303     end_time=end_time,
   1304     record_migrations=record_migrations,
   1305     record_full_arg=record_full_arg,
   1306     additional_nodes=additional_nodes,
   1307     coalescing_segments_only=coalescing_segments_only,
   1308     num_labels=num_labels,
   1309     random_seed=random_seed,
   1310 )
   1311 return _wrap_replicates(
   1312     sim,
   1313     num_replicates=num_replicates,
   1314     replicate_index=replicate_index,
   1315     provenance_dict=provenance_dict,
   1316 )

File ~/work/tskit-site/tskit-site/msprime/ancestry.py:1081, in _parse_sim_ancestry(samples, sequence_length, recombination_rate, gene_conversion_rate, gene_conversion_tract_length, discrete_genome, population_size, demography, ploidy, model, initial_state, start_time, end_time, record_migrations, record_full_arg, additional_nodes, coalescing_segments_only, num_labels, random_seed, init_for_debugger)
   1078 random_seed = _parse_random_seed(random_seed)
   1079 random_generator = _msprime.RandomGenerator(random_seed)
-> 1081 return Simulator(
   1082     tables=initial_state,
   1083     recombination_map=recombination_map,
   1084     gene_conversion_map=gene_conversion_map,
   1085     gene_conversion_tract_length=gene_conversion_tract_length,
   1086     discrete_genome=discrete_genome,
   1087     ploidy=ploidy,
   1088     demography=demography,
   1089     models=models,
   1090     store_migrations=record_migrations,
   1091     additional_nodes=additional_nodes,
   1092     coalescing_segments_only=coalescing_segments_only,
   1093     start_time=start_time,
   1094     end_time=end_time,
   1095     num_labels=num_labels,
   1096     random_generator=random_generator,
   1097 )

File ~/work/tskit-site/tskit-site/msprime/ancestry.py:1426, in Simulator.__init__(self, tables, recombination_map, gene_conversion_map, gene_conversion_tract_length, discrete_genome, ploidy, demography, random_generator, models, store_migrations, additional_nodes, coalescing_segments_only, start_time, end_time, num_labels)
   1423 gene_conversion_rate = gene_conversion_map.rate[0]
   1425 start_time = -1 if start_time is None else start_time
-> 1426 super().__init__(
   1427     tables=ll_tables,
   1428     recombination_map=ll_recomb_map,
   1429     start_time=start_time,
   1430     random_generator=random_generator,
   1431     migration_matrix=demography.migration_matrix,
   1432     population_configuration=ll_population_configuration,
   1433     demographic_events=ll_demographic_events,
   1434     store_migrations=store_migrations,
   1435     additional_nodes=additional_nodes.value,
   1436     coalescing_segments_only=coalescing_segments_only,
   1437     num_labels=num_labels,
   1438     segment_block_size=segment_block_size,
   1439     avl_node_block_size=avl_node_block_size,
   1440     node_mapping_block_size=node_mapping_block_size,
   1441     gene_conversion_rate=gene_conversion_rate,
   1442     gene_conversion_tract_length=gene_conversion_tract_length,
   1443     discrete_genome=discrete_genome,
   1444     ploidy=ploidy,
   1445 )
   1446 # Highlevel attributes used externally that have no lowlevel equivalent
   1447 self.end_time = np.inf if end_time is None else end_time

InputError: Input error in initialise: Attempt to sample lineage in a population with size=0

To recover the original model we must update the Population objects in place and add the population split event:

demography["A"].initial_size = 100
demography["B"].initial_size = 200
demography["C"].initial_size = 300
demography.add_population_split(time=100, ancestral="C", derived=["A", "B"])
demography
Populations (3)
idnamedescriptioninitial_sizegrowth_ratedefault_sampling_timeextra_metadata
0AContemporary population A100.000{}
1BContemporary population B200.000{}
2CAncestral population300.001e+02{}
Migration matrix (all zero)
Events (1)
timetypeparameterseffect
100Population Splitderived=[A, B], ancestral=CMoves all lineages from derived populations 'A' and 'B' to the ancestral 'C' population. Also set the derived populations to inactive, and all migration rates to and from the derived populations to zero.

Note

See the Demography documentation for details on how to access Population objects.

The demography that we obtain from Demography.from_tree_sequence() is really just a template, which we update and elaborate as we need. In particular, we are free to add more populations as needed.

Continuing forwards-time simulations (“recapitating”)#

The most common use for the initial_state argument is to start the simulation from the output of a forwards-time simulation. Informally, we take an ‘unfinished’ tree sequence as a parameter to sim_ancestry(), initialise the simulation from the state of this tree sequence and then run the simulation until coalescence. The returned tree sequence is then the result of taking the input tree sequence and completing the trees using the coalescent.

This is useful for forwards-time simulators such as SLiM and fwdpy11 that can output tree sequences. By running forward-time simulation for a certain number of generations we obtain a tree sequence, but these trees may not have had sufficient time to reach a most recent common ancestor. By using the initial_state argument to sim_ancestry() we can combine the best of both forwards- and backwards-time simulators. The recent past can be simulated forwards in time and the ancient past by the coalescent. The coalescent simulation is initialised by the root segments of the input tree sequence, ensuring that the minimal amount of ancestral material possible is simulated.

Any tree sequence can be provided as input to this process, but there is a specific topological requirement that must be met for the simulations to be statistically correct. To ensure that ancestral segments are correctly associated within chromosomes when constructing the initial conditions for the coalescent simulation, forward-time simulators must retain the nodes corresponding to the initial generation. Furthermore, for every sample in the final generation (i.e. the extant population at the present time) there must be a path to one of the founder population nodes.

See also

See the recapitation tutorial from the pyslim documentation for a detailed explanation of this process when using SLiM.

Models#

The ancestry model determines the model under which the ancestral history of the sample is generated. It is concerned with the basic processes of how (for example) common ancestor and recombination events occur. For example, the default “standard” or Hudson coalescent is an efficient continuous time model, and the Discrete Time Wright-Fisher model allows for more detailed (if less efficient) simulations by making fewer mathematical approximations. We can combine multiple ancestry models in a simulation using a flexible notation.

See also

See the Quick reference for a summary of the available ancestry models

Important

The ancestry model is distinct from the demographic model, which is mostly independent of the chosen ancestry model.

Specifying ancestry models#

The ancestry models used during a simulation are specified using the model parameter to sim_ancestry(). Each model is a subclass of the AncestryModel class (see the following subsections for the available models, and examples of their usage).

Single models#

Most of the time we want to run a simulation under a single ancestry model. By default, we run simulations under the StandardCoalescent model. If we wish to run under a different model, we use the model argument to sim_ancestry(). For example, here we use the SMC model instead of the standard coalescent:

ts1 = msprime.sim_ancestry(
    10,
    sequence_length=10,
    recombination_rate=0.1,
    model=msprime.SmcApproxCoalescent(),
    random_seed=1234)

We specify the model as an instance of one of the ancestry model classes. For nonparametric models, there is also a useful shorthand of the model name:

ts2 = msprime.sim_ancestry(
    10,
    sequence_length=10,
    recombination_rate=0.1,
    model="smc",
    random_seed=1234)
assert ts1.equals(ts2, ignore_provenance=True)

This string form is useful for single model simulations; however, when working with Multiple models it is better to use the explicit class form so that the Model duration can be set.

Model duration#

Each ancestry model instance has a duration associated with it. This is the maximum amount of time that this model can run for. Thus, if we wanted to run 10 generations of the DiscreteTimeWrightFisher model we could write:

ts = msprime.sim_ancestry(
    3,
    population_size=10,
    model=msprime.DiscreteTimeWrightFisher(duration=10),
    random_seed=1234)
SVG(ts.draw_svg(y_axis=True))
_images/271f15721480677eafe98ea4c70e9792e42882bbeeca02207344f9a1be7e4f0f.svg

Note

Using the duration value for a single model like this is identical to specifying an end_time value: see the Stopping simulations early section for more information.

It is vital to understand that the duration value here is not an absolute time at which the simulation must stop, but rather the number of generations that the model should run for. To illustrate this we can use the start_time parameter:

ts = msprime.sim_ancestry(
    3,
    population_size=10,
    start_time=5,
    model=msprime.DiscreteTimeWrightFisher(duration=10),
    random_seed=1234)
SVG(ts.draw_svg(y_axis=True))
_images/5708332a0b6dfd2080cc9ead69b2ad7bcf0abec7dc53b03f58f46c9aaf3d2248.svg

We can see that the simulation continued until 15 generations ago, because it didn’t start until 5 generations ago and we then specified that the model should run for at most 10 generations.

The model activity time is specified as a duration in this way as this is a natural way to specify Multiple models, in which some of the models may run for random periods of time.

Model completion#

Most ancestry models will run until the overall simulation has completed; that is, when we have simulated back to the most recent common ancestor at every position along the sequence. For example, the StandardCoalescent and DiscreteTimeWrightFisher models will simulate until coalescence, unless we explicitly indicate that we wish to stop the simulation early via the model duration or the end_time parameter.

Models such as SweepGenicSelection are different though: they are only defined for a limited amount of time and may not run until coalescence (see the Selective sweeps for more information on the model itself).

sweep = msprime.SweepGenicSelection(
    position=5,
    start_frequency=0.1,
    end_frequency=0.9,
    s=0.25,
    dt=1e-6,
)
ts = msprime.sim_ancestry(
    3,
    sequence_length=10,
    population_size=100,
    model=sweep,
    random_seed=1234)
SVG(ts.draw_svg(y_axis=True))
_images/eefbe1cdfe74833d8a81d33add0367e68b89cff56e17ac7c6e3c5f9aec5048cf.svg

Here, we see that even though we didn’t specify a model duration or end_time parameter the simulation has ended before we reached coalescence. In addition, the model duration is random:

ts = msprime.sim_ancestry(
    3,
    sequence_length=10,
    population_size=100,
    model=sweep,
    random_seed=12345)
SVG(ts.draw_svg(y_axis=True))
_images/3dab8c2a0b030a4e487e73dfaf84fd6c34f5a049ae119f8aef5d4bfaec7430c4.svg

Here we ran the same simulation with a different random seed, and we can see that the time the model completed was different in the two cases (30.22 vs 28.15 generations ago).

Note

Models such as SweepGenicSelection can also complete due to full coalescence.

Multiple models#

Sometimes we are interested in combining different models in a simulation. To do this we provide a list of ancestry model instances as the model argument to sim_ancestry(). The models are run in the order they are specified. For each model, if its duration is set, then the simulation will run for at most that additional time duration. If the duration is not set (or None), then the model will run until completion.

For example, here we run a hybrid DTWF and coalescent simulation (see the Discrete Time Wright-Fisher section for more details):

ts = msprime.sim_ancestry(
    2,
    population_size=1000,
    model=[
        msprime.DiscreteTimeWrightFisher(duration=500),
        msprime.StandardCoalescent(),
    ],
    random_seed=2
)

Warning

It is very imporant to remember to set a duration value in the first model here! Otherwise, the entire simulation will run under the DTWF model.

In this example we’ve run the DiscreteTimeWrightFisher model for the first 500 generations, and then switched to the StandardCoalescent. Because we did not specify a duration in the second model, it will run until coalescence.

Note

Switching models at a fixed time point like this is equivalent to first running the DTWF phase of the simulation with end_time=500 and then using the output as the initial state for the simulation of the Hudson phase. The model switching syntax here is a little more convenient and efficient, however.

See also

See the Multiple sweeps section for an example of running many models with random durations.

Tip

The logging output of msprime can be very useful when working with multiple models. See the Logging section for more details.

Hudson coalescent#

The standard coalescent is the default model in msprime. The algorithm for simulating the coalescent with recombination was developed by Hudson (1983), and so we refer to it as the “Hudson” coalescent.

Running a simulation without specifying a model is the same as running with model="hudson":

ts1 = msprime.sim_ancestry(5, random_seed=2)
ts2 = msprime.sim_ancestry(5, model="hudson", random_seed=2)
# This is the same simulation so tree sequences are equal
assert ts1.equals(ts2, ignore_provenance=True)

Time is measured in units of generations ago. This is a continuous time model, in which the simulation moves event-by-event backwards in time (contrasting with the Discrete Time Wright-Fisher model, which works generation-by-generation). Thus, we can have fractional generations, as in this example where the MRCA of the sample occurs 10.59 “generations” ago:

ts = msprime.sim_ancestry(3, random_seed=234)
SVG(ts.draw_svg(y_axis=True))
_images/0ffdc8a306ca020f4de59b96dba11378e0b5983b036e1e1166cabe5636147ca7.svg

This occurs because time is scaled to be proportional to the population size; if we run the same simulation with a different population size, we can see that the time values are simply scaled up accordingly (the default population_size is 1):

ts = msprime.sim_ancestry(3, population_size=100, random_seed=234)
SVG(ts.draw_svg(y_axis=True))
_images/525d3a255df07178fb494f851fe89da738423a7db212eecea0e8688f062479bf.svg

See also

The ploidy parameter also controls the time scale; the default here is the diploid time scale where time is measured in units of 4N generations.

SMC approximations#

The SMC and SMC' are approximations of the continuous time Hudson coalescent model. These were originally motivated largely by the need to simulate coalescent processes more efficiently than was possible using the software available at the time; however, improved algorithms mean that such approximations are now mostly unnecessary for simulations.

The SMC and SMC’ are however very important for inference, as the approximations have made many analytical advances possible.

Since the SMC approximations are not required for simulation efficiency, these models are implemented using a naive rejection sampling approach in msprime. The implementation is intended to facilitate the study of the SMC approximations, rather than to be used in a general-purpose way.

Discrete Time Wright-Fisher#

Msprime provides the option to perform discrete-time Wright-Fisher (DTWF) simulations for scenarios when the coalescent model is not appropriate, including large sample sizes, multiple chromosomes, or recent migration. Please see the API documentation for a formal description of the model, and Nelson et al. 2020 for more details and background information.

To run DTWF simulations, we need to provide the model="dtwf" argument to sim_ancestry():

ts = msprime.sim_ancestry(4, population_size=10, model="dtwf", random_seed=7)
SVG(ts.draw_svg(y_axis=True, time_scale="log_time"))
_images/8564e6bd95f1de9dcd330624ee44f5034d25f0b28d1ed55688926f4fe188b7e9.svg

There are a few important points to note here:

  1. All node times are integers (this is discrete time counted in generations);

  2. We can have non-binary nodes (see node 10);

  3. We can have simultaneous events (see nodes 8 and 9).

Most features can be used with the DTWF model, and an error will be raised if you attempt to use a feature that is not implemented for DTWF. (Please let us know if this feature is important to you!)

Important

For DTWF simulations with population structure, each row of the migration matrix must sum to one or less.

Because DTWF simulations trace the history of a sample by going backwards in time generation-by-generation rather than event-by-event like the continuous time models, it is significantly slower to simulate than the standard coalescent.

“Hybrid” simulations can be a good solution if DTWF simulations are slow. In this case we can use DTWF simulations for the recent past, and using the standard coalescent for the more distant past (where it is an excellent approximation of the discrete time model). Hybrid simulations of this type provide the best of computational efficiency and accuracy.

See also

See Nelson et al. 2020 for more details and examples where the DTWF model is more realistic than the coalescent, and an assessment of the accuracy of hybrid simulations.

For example, here we simulate with the DTWF model for 500 generations, before switching to the standard (Hudson) coalescent:

ts = msprime.sim_ancestry(
    2,
    population_size=1000,
    model=[
        msprime.DiscreteTimeWrightFisher(duration=500),
        msprime.StandardCoalescent(),
    ],
    random_seed=4
)
SVG(ts.draw_svg(y_axis=True, time_scale="log_time"))
_images/5fa3d77c916ed4e2d2ee52149e035cb6ad9b540acedbe10e2ba8a765b9a7ac63.svg

Because of the integer node times, we can see here that most of the coalescent happened during the Wright-Fisher phase of the simulation, and as-of 500 generations in the past, there were only two lineages left. The continuous time standard coalescent model was then used to simulate the ancient past of these two lineages.

See the Specifying ancestry models section for more details on how the use the model argument to sim_ancestry() to run multiple models.

See also

See the Multiple chromosomes section for information on how to simulate multiple chromosomes using the DTWF.

Multiple merger coalescents#

Some evolutionary scenarios, such as a skewed offspring distribution combined with a type III survivorship curve, range expansion, and rapid adaptation, can predict genealogies with up to four simultaneous multiple mergers. Msprime provides the option to simulate from two classes of such genealogical processes: the BetaCoalescent and the DiracCoalescent. Please see the API documentation for formal details of the models.

For haploid organisms, both models result in genealogies in which any number of lineages can merge into a common ancestor, but only one merger event can take place at a given time. For \(p\)-ploids, up to \(2p\) simultaneous mergers can take place, corresponding to the \(2p\) available parental chromosome copies.

The diploid Beta-Xi-coalescent can be simulated as follows:

ts = msprime.sim_ancestry(
    samples=5, ploidy=2, random_seed=1,
    model=msprime.BetaCoalescent(alpha=1.001))
SVG(ts.draw_svg())
_images/3296b5c225cf50eeacd08937ee84b6dcbe2b17bb41905b34f5488cae216122d6.svg

The specified value of \(\alpha = 1.001\) corresponds to a heavily skewed offspring distribution. Values closer to \(\alpha = 2\) result in trees whose distribution is closer to that of the standard coalescent, often featuring no multiple mergers for small sample sizes:

ts = msprime.sim_ancestry(
    samples=4, ploidy=2, random_seed=1,
    model=msprime.BetaCoalescent(alpha=1.8))
SVG(ts.draw_svg())
_images/190ee68f4cd1f07df3f5d7cee840e2495411a3563d91099618c903b406ed6129.svg

Multiple mergers still take place in a haploid simulation, but only one merger can take place at a given time:

ts = msprime.sim_ancestry(
    samples=10, ploidy=1, random_seed=1,
    model=msprime.BetaCoalescent(alpha=1.001))
SVG(ts.draw_svg())
_images/a4bcdb2ea8089ba0cbe23f10e0f4cf3bba110206cd7cb2f75607f74860b2cb0b.svg

A haploid simulation results in larger individual mergers than a polyploid simulation because large mergers typically get broken up into multiple simultaneous mergers in the polyploid model.

The number of generations between merger events in the Beta-coalescent depends nonlinearly on both \(\alpha\) and the population size \(N\) as detailed above. For a fixed \(\alpha\), the number of generations between common ancestor events is proportional to \(N^{\alpha - 1}\), albeit with a complicated constant of proportionality that depends on \(\alpha\). The dependence on \(\alpha\) for fixed \(N\) is not monotone. Thus, branch lengths and the number of generations until a most recent common ancestor depend on both of these parameters.

To illustrate, for \(\alpha\) close to 2 the relationship between effective population size and number of generations is almost linear:

ts = msprime.sim_ancestry(
    samples=1, ploidy=2, random_seed=1, population_size=10,
    model=msprime.BetaCoalescent(alpha=1.99))
tree = ts.first()
print(tree.tmrca(0,1))
ts = msprime.sim_ancestry(
    samples=1, ploidy=2, random_seed=1, population_size=1000,
    model=msprime.BetaCoalescent(alpha=1.99))
tree = ts.first()
print(tree.tmrca(0,1))
0.14959691919068155
14.286394871874865

For \(\alpha\) close to 1 the effective population size has little effect:

ts = msprime.sim_ancestry(
    samples=1, ploidy=2, random_seed=1, population_size=10,
    model=msprime.BetaCoalescent(alpha=1.1))
tree = ts.first()
print(tree.tmrca(0,1))
ts = msprime.sim_ancestry(
    samples=1, ploidy=2, random_seed=1, population_size=1000,
    model=msprime.BetaCoalescent(alpha=1.1))
tree = ts.first()
print(tree.tmrca(0,1))
16.311807036386615
25.85247192870844

The Dirac-coalescent is simulated similarly in both the diploid case:

ts = msprime.sim_ancestry(
    samples=5, ploidy=2, random_seed=1,
    model=msprime.DiracCoalescent(psi=0.9, c=10))
SVG(ts.draw_svg())
_images/2f58998201644cbcb2be79164e750d45139f39dde32e8b1cb881ed093d60cd08.svg

and in the haploid case:

ts = msprime.sim_ancestry(
    samples=10, ploidy=1, random_seed=1,
    model=msprime.DiracCoalescent(psi=0.9, c=10))
SVG(ts.draw_svg())
_images/453f1926088591bfbe29bb75cf92cc6b8dfc754b9172c90c8d48d974f4b73f8b.svg

As with the Beta-coalescent, a haploid simulation results in larger individual mergers than a polyploid simulation because large mergers typically get broken up into multiple simultaneous mergers in the polyploid model. Larger values of the parameter \(c > 0\) result in more frequent multiple mergers, while larger values of \(0 < \psi \leq 1\) result in multiple mergers with more participating lineages. Setting either parameter to 0 would correspond to the standard coalescent.

The Dirac-coalescent is obtained as the infinite population scaling limit of Moran models, and therefore branch lengths are proportional to \(N^2\) generations, as opposed to \(N\) generations under the standard coalescent.

ts = msprime.sim_ancestry(
    samples=1, ploidy=2, random_seed=1, population_size=10,
    model=msprime.DiracCoalescent(psi=0.1, c=1))
tree = ts.first()
print(tree.tmrca(0,1))
ts = msprime.sim_ancestry(
    samples=1, ploidy=2, random_seed=1, population_size=100,
    model=msprime.DiracCoalescent(psi=0.1, c=1))
tree = ts.first()
print(tree.tmrca(0,1))
64.7216732891741
6472.167328917411

Because branch lengths are proportional to \(N^2\) generations, the number of events simulated along branches (such as mutations or recombinations) will also be proportional to \(N^2\). For example, if a tree sequence generated using a given census size \(N\) is used as input for simulating segregating sites (see Mutation simulations for details on how to do this), the expected number of segregating sites obtained will be proportional to \(N^2\).

ts = msprime.sim_ancestry(
    samples=1, ploidy=2, random_seed=1, population_size=10000,recombination_rate=1e-8,
    sequence_length=10**4,model=msprime.DiracCoalescent(psi=0.1, c=0))
mts = msprime.sim_mutations(ts, rate=1e-8)
print(str(mts.get_num_sites()))
6652

If the desired amount of diversity (as measured by the number of segregating sites) is proportional to \(N\), then \(N^{1/2}\) should be used for the population size under the Dirac-coalescent.

ts = msprime.sim_ancestry(
    samples=1, ploidy=2, random_seed=1, population_size=100,recombination_rate=1e-8,
    sequence_length=10**4,model=msprime.DiracCoalescent(psi=0.1, c=0))
mts = msprime.sim_mutations(ts, rate=1e-8)
print(str(mts.get_num_sites()))
1

Therefore the population size must be carefully chosen (and potentially rescaled) to obtain the desired simulated data in these models.

Selective sweeps#

Although the coalescent assumes that lineages are exchangable, (and therefore evolving neutrally), approximations to some forms of selective sweep (beneficial mutation moves through the population), have been derived. This is done through the use of a structured coalescent model in the spirit of Braverman et al. (1995).

In the SweepGenicSelection model, the population is split between two classes of selective backgrounds, those linked to the beneficial allele, call it \(B\), and those not, \(b\). In this model we track competing rates of coalescence and recombination on the \(B\) and \(b\) backgrounds. The user supplies a final allele frequency and a starting allele frequency, between which msprime simulates a stochastic sweep trajectory according to a conditional diffusion model (Coop and Griffiths, 2004; Kim and Stephan 2002).

Beyond the start and end frequencies of the sweep trajectory, the user must specify the selection coefficient of the beneficial mutation \(s\) the selective advantage of the \(B\) homozygote over the \(b\) homozygote and \(h=0.5\). The position represents the location along the chromosome where the beneficial allele occurs. All other parameters can be set as usual.

Warning

The implementation of the structured coalescent is quite general, but there are some limitations in the current implementation of the sweeps model (e.g., no change of population size during a sweep). Please let us know if there are related features you would like to see (or if you would be interested in helping to create this functionality!)

Hard sweeps#

Todo

What’s a hard sweep? Some intro sentences here.

In this example we run some replicate hard sweep simulations and plot the mean pairwise diversity in windows across the simulated region, showing the characteristic valley of diversity.

First we set up some basic parameters, and define the sweep model:

Ne = 1e3
L = 1e6  # Length of simulated region
num_reps = 100

# define hard sweep model
sweep_model = msprime.SweepGenicSelection(
    position=L / 2,  # middle of chrom
    start_frequency=1.0 / (2 * Ne),
    end_frequency=1.0 - (1.0 / (2 * Ne)),
    s=0.25,
    dt=1e-6,
)

Todo

Explain why these parameters give a hard sweep.

Next we set up the replicate simulations. As described in the Multiple models section, the model parameter is a list when we want to run multiple ancestry models. In this example, we have a sweep that occurs in the immediate past, and is then followed by the standard coalescent for the rest of time:

reps = msprime.sim_ancestry(
    5,
    model=[sweep_model, msprime.StandardCoalescent()],
    population_size=Ne,
    recombination_rate=1e-7,
    sequence_length=L,
    num_replicates=num_reps,
)

Note

Because the SweepGenicSelection model has a random duration we do not set this value in the model. Please see the Model completion section for more discussion on this important point.

Once we’ve set up the replicate simulations we can compute the windows for plotting, run the actual simulations (see the Running replicate simulations section for more details) and compute the pairwise diversity.

wins = np.linspace(0, L, 21)
mids = (wins[1:] + wins[:-1]) / 2
diversity = np.zeros((num_reps, mids.shape[0]))
for j, ts in enumerate(reps):
    diversity[j] = ts.diversity(windows=wins, mode="branch")

Finally, we can plot the observed mean diversity across the replicates and compare it to the neutral expectation:

from matplotlib import pyplot as plt

plt.plot(mids, diversity.mean(axis=0), label="Simulations")
plt.axhline(4 * Ne, linestyle=":", label=r'Neutral expectation')
plt.ylabel(r'Branch $\pi$');
plt.xlabel('Position (bp)')
plt.legend();
_images/d30c366257379841d282ef9a968a621a1443b74a49bc179ec5bbdc1cc692aa04.svg

Note

We use the “branch” measure of pairwise diversity which is defined in terms of the trees rather than nucleotide sequences. See Ralph et al. 2020 for more information.

As we can see, the selective sweep has reduced variation in the region most closely linked to the beneficial allele and then diversity increases with distance to each side.

Multiple sweeps#

Todo

This example is very artificial; it would be much better to illustrate with a meaningful example from the literature. Pull requests welcome!

To illustrate the idea that we can simulate large numbers of different models, in this example we create a simulation in which we have sweeps happening at random points along the genome, which are separated by a random duration of the standard coalescent.

First we set up our models:

import random
random.seed(1234)
L = 100
Ne = 1e6

models = []
for _ in range(100):
    models.append(msprime.StandardCoalescent(duration=random.uniform(0, 1)))
    models.append(
        msprime.SweepGenicSelection(
            position=random.randint(1, L - 1),
            start_frequency=1 / (2 * Ne),
            end_frequency=1.0 - 1 / (2 * Ne),
            s=0.01,
            dt=1e-6,
        )
    )
models.append(msprime.StandardCoalescent())
models[:5]
[StandardCoalescent(duration=0.9664535356921388),
 SweepGenicSelection(duration=None, position=57, start_frequency=5e-07, end_frequency=0.9999995, s=0.01, dt=1e-06),
 StandardCoalescent(duration=0.11685051774599753),
 SweepGenicSelection(duration=None, position=12, start_frequency=5e-07, end_frequency=0.9999995, s=0.01, dt=1e-06),
 StandardCoalescent(duration=0.9109759624491242)]

We’ve created a list of models that have random parameters using the standard Python tools. There are lots of sweeps (many more than we will actually run, in all probability), each at a random position on the genome and each separated by a random amount of time running the standard coalescent. Finally, we add an instance of the standard coalescent without a duration to guarantee that our simulation will always coalesce fully (See the Model completion section for more information.) Finally, we print out the first 5 models so we can inspect them.

Then, we can run our simulation as usual:

ts = msprime.sim_ancestry(
    10,
    population_size=100,
    sequence_length=L,
    recombination_rate=0.01,
    model=models,
    random_seed=6789
)
ts
Tree Sequence
Trees100
Sequence Length100.0
Time Unitsgenerations
Sample Nodes20
Total Size102.7 KiB
MetadataNo Metadata
Table Rows Size Has Metadata
Edges 1588 49.6 KiB
Individuals 10 304 Bytes
Migrations 0 8 Bytes
Mutations 0 16 Bytes
Nodes 523 14.3 KiB
Populations 1 224 Bytes
Provenances 1 25.8 KiB
Sites 0 16 Bytes

Tip

The logging output of msprime can be very useful when working with multiple models. See the Logging section for more details.

Fixed pedigree#

A pedigree describes parent-offspring relationships between individuals, and these relationships constrain how genomes are inherited. We can simulate ancestry conditioned on a fixed pedigree using the FixedPedigree ancestry model. In this model the transfer of genetic material from parent to offspring is determined by the pedigree. Recombination occurs randomly such that each of a diploid individual’s two genomes is a mixture of two corresponding parental genomes, but everything else is determined by the pedigree.

Warning

It is important to note that there is no concept of a population genetic model “outside” of the pedigree in this model, and there are significant caveats when dealing with incomplete pedigree information and the treatment of founders. Please see the Completing fixed pedigree simulations section for more information.

Simple example#

First we define a small pedigree, which follows a sample of three individuals through four generations:

See also

See the Pedigrees section for details on this pedigree text format and information on how pedigrees are encoded in msprime via the individual and node tables.

import io
ped_txt = """\
# id parent0 parent1 time is_sample
0   5   4   0.0 1
1   3   3   0.0 1
2   3   4   0.0 1
3   7   7   1.0 0
4   8   8   1.0 0
5   8   6   1.0 0
6   9   10  2.0 0
7   10  10  2.0 0
8   9   9   2.0 0
9   11  12  3.0 0
10  11  12  3.0 0
11  .   .   4.0 0
12  .   .   4.0 0
"""
pedigree = msprime.parse_pedigree(io.StringIO(ped_txt), sequence_length=100)

draw_pedigree(pedigree.tree_sequence())
_images/95e203480de2293e2269e176bddfd9fb9409bb0e6ab2da2f88eeb4f64e7ce82e.svg

Note

See the Visualising pedigrees section for a definition of the draw_pedigree function used here.

We then use this pedigree information as the initial state for the simulation (note that we set the sequence length on the pedigree tables when we called parse_pedigree()):

ped_ts = msprime.sim_ancestry(
    initial_state=pedigree, model="fixed_pedigree", random_seed=41,
    recombination_rate=0.001)
node_labels = {node.id: f"{node.individual}({node.id})" for node in ped_ts.nodes()}
SVG(ped_ts.draw_svg(y_axis=True,  node_labels=node_labels, size=(600,200)))
_images/333e8540bdf715f615459642765add579328ab962edcc053d5fc563c601705bf.svg

The output tree sequence contains the simulated genetic ancestry conditioned on this input pedigree, as shown in the trees above. Each node is labelled with it the individual ID it’s associated with, and its ID (as individual(node)). The simulation stops at time 4, and we are left with multiple roots, which correspond to the founder individuals. If you only need simulations of the genetic ancestry through the pedigree, then these trees can be used perfectly well in downstream applications (such as Mutation simulations). See the Completing fixed pedigree simulations section for information (and caveats about) completing these simulations.

Note

Because the input pedigree fully describes the simulation many features such as demography are not available when we use the FixedPedigree model. The FixedPedigree model also cannot be combined with other ancestry models.

Censoring pedigree information#

Warning

The output tree sequence also contains the full pedigree information provided as input. This may be important if you are performing simulations on pedigrees that are not freely available.

It is straightforward to remove the pedigree information from an output tree sequence, if required:

tables = ped_ts.dump_tables()
tables.individuals.clear()
tables.nodes.individual = np.full_like(tables.nodes.individual, tskit.NULL)
censored_ts = tables.tree_sequence()
censored_ts
Tree Sequence
Trees2
Sequence Length100.0
Time Unitsgenerations
Sample Nodes6
Total Size2.6 KiB
MetadataNo Metadata
Table Rows Size Has Metadata
Edges 13 424 Bytes
Individuals 0 40 Bytes
Migrations 0 8 Bytes
Mutations 0 16 Bytes
Nodes 26 736 Bytes
Populations 1 224 Bytes
Provenances 1 1.0 KiB
Sites 0 16 Bytes

Completing fixed pedigree simulations#

If you wish to simulate ancestry older than what is defined in the pedigree, you can “complete” (or “recapitate”) the within-pedigree simulation using the standard methods described in the Specifying the initial state section.

Continuing the example in the previous section, we provide the simulated ancestry in ped_ts as the initial state for a DiscreteTimeWrightFisher simulation with a population size of 10:

final_ts = msprime.sim_ancestry(
    initial_state=ped_ts, model="dtwf", population_size=10, random_seed=42)
SVG(final_ts.draw_svg(y_axis=True, size=(600,200)))
_images/9beb59d45a2f5c6ce88f0bb2486f15eaeb270809cd35627d88f8894778ffc533.svg

We can see that the nodes 22 and 23 representing the founder genomes from the input pedigree are used as the starting point for the DTWF simulation, which then finishes at generation 6 when the MRCA (node 26) is found.

Missing pedigree data#

The previous example is very straightforward because the pedigree is complete and we can trace all individuals back to the same generation in the pedigree. The pedigrees we have in practice are usually not like this, and there is a great deal of missing data. The consequence of this for the FixedPedigree simulations is that genetic material will often be at a “dead end” at many levels of the pedigree during a simulation.

ped_txt = """\
# id parent0 parent1 time is_sample
0   5   4   0.0 1
1   3   3   0.0 1
2   3   4   0.0 1
3   7   7   1.0 0
4   8   8   1.0 0
5   8   6   1.0 0
6   9   10  2.0 0
7   .   .   2.0 0
8   9   9   2.0 0
9   11  12  3.0 0
10  11  12  3.0 0
11  .   .   4.0 0
12  .   .   4.0 0
"""
pedigree = msprime.parse_pedigree(io.StringIO(ped_txt), sequence_length=100)

draw_pedigree(pedigree.tree_sequence())
_images/7dfc445c6b266c7f6b5bea3053020aa0bbcc67c47566f46277a689492d4aa825.svg

This pedigree has a “dead end” for individual 7, which is reflected in the trees that we output from the FixedPedigree simulation:

ped_ts = msprime.sim_ancestry(
    initial_state=pedigree, model="fixed_pedigree", random_seed=41,
    recombination_rate=0.001)
node_labels = {node.id: f"{node.individual}({node.id})" for node in ped_ts.nodes()}
SVG(ped_ts.draw_svg(y_axis=True,  node_labels=node_labels, size=(600,200)))
_images/7bdd5525233b4c744556f2db31d59cacec2d365bb735305044a20b58100a2277.svg

When we provide this tree sequence as the initial state of another simulation, we then effectively start the simulation at the earlier time point corresponding to individual 7. Ancestral material in older parts of the pedigree is then brought into the simulation in the same manner as ancient samples as we proceed backwards in time.

Warning

It is important to note that although ancestral material for pedigree founders is introduced into the recapitation simulation at the correct time, no account is taken of the number of lineages present in the pedigree when calculating population sizes. Thus, the pedigree must be seen as entirely external to the population model simulated during recapitation.

Pedigrees and demography#

For complex simulations in which our pedigree individuals (and founders, in particular) are drawn from different populations, we will want our follow-up recapitation simulations to use a demographic model. In order to do this, we must define the demographic model before creating the input pedigree. For example, consider the following simple demography:

demography = msprime.Demography()
demography.add_population(name="A", initial_size=10)
demography.add_population(name="B", initial_size=20)
demography.add_population(name="C", initial_size=100)
demography.add_population_split(time=10, derived=["A", "B"], ancestral="C");

We define a pedigree in which the individuals belong to the two “leaf” populations, A and B:

ped_txt = """\
# id    parent0 parent1 time    is_sample   population
0   9   11  0.0 1   A
1   10  10  0.0 1   A
2   6   7   0.0 1   A
3   7   7   0.0 1   A
4   8   9   0.0 1   A
5   10  10  0.0 1   A
6   14  14  1.0 0   A
7   12  12  1.0 0   A
8   16  17  1.0 0   A
9   12  13  1.0 0   A
10  14  12  1.0 0   A
11  15  16  1.0 0   A
12  .   .   2.0 0   A
13  .   .   2.0 0   A
14  .   .   2.0 0   A
15  .   .   2.0 0   A
16  .   .   2.0 0   A
17  .   .   2.0 0   A
18  23  22  0.0 1   B
19  22  22  0.0 1   B
20  21  22  0.0 1   B
21  25  25  1.0 0   B
22  24  25  1.0 0   B
23  24  24  1.0 0   B
24  27  26  2.0 0   B
25  27  26  2.0 0   B
26  30  28  3.0 0   B
27  30  29  3.0 0   B
28  31  32  4.0 0   B
29  31  32  4.0 0   B
30  32  31  4.0 0   B
31  .   .   5.0 0   B
32  .   .   5.0 0   B
"""
pedigree = msprime.parse_pedigree(
    io.StringIO(ped_txt), demography=demography, sequence_length=100)

draw_pedigree(pedigree.tree_sequence())
_images/c81d0dd1106769f2408e184b1742c82c529bb2175335e40ae6fa7e8f5a0fc349.svg

We can see that the pedigree is split into two sub-pedigrees for the two populations, and the founders for the two populations exist at different times (2 and 5 generations ago for A and B, respectively). Running the FixedPedigree simulation, we can see that the trees reflect this structure:

ped_ts = msprime.sim_ancestry(
    initial_state=pedigree, model="fixed_pedigree", random_seed=41)
node_labels = {node.id: f"{node.individual}({node.id})" for node in ped_ts.nodes()}
SVG(ped_ts.draw_svg(y_axis=True,  node_labels=node_labels, size=(600,200)))
_images/1278265a496ec4a4a5f75488ecd354f2e29b716495f47ed99a1597312e10c988.svg

When we complete this simulation we must use a Demography object that matches the populations defined in the original pedigree tables. The simplest way to do this is define the demography first, and use the same object throughout as we have done here:

full_ts = msprime.sim_ancestry(
    initial_state=ped_ts, demography=demography, random_seed=42, model="dtwf")
node_labels = {
    node.id: f"{node.id}:{full_ts.population(node.population).metadata['name']}"
    for node in full_ts.nodes()}
SVG(full_ts.draw_svg(
    y_axis=True,  node_labels=node_labels, size=(600,300), time_scale="rank"))
_images/f6e12f1096add1cdf197eb2c7957fc20b625f00a5a3ef3e4fccadbf4e9bb1776.svg

Here we have changed the node labelling in the trees to show the name of the population of each node. We can see that all of the early nodes are in the populations defined from the pedigree. When the recapitation simulation begins, all nodes are assigned into the relevant populations and the simulation proceeds from there under the defined demography. In this demographic model, populations A and B split from C 10 generations ago, and so the remaining lineages in populations A and B evolve independently until this time. We can see that older coalescences all happen within population C, as stipulated by the demographic model.

Warning

The same caveats that we noted above about missing data in the pedigree apply: although the remaining ancestral lineages will join the simulation at the correct times and in the specified population, no account is taken of the number of extant lineages from the pedigree when computing population sizes.

Missing data#

Todo

Give an example of doing a simulation with unknown rates on the flanks and show that the resulting tree sequence has missing data in these regions.

Common errors#

Infinite waiting time#

Todo

explain this, why it happens and give examples of when we don’t detect it. Mention the possible_lineage_locations method.

demography = msprime.Demography()
demography.add_population(name="A", initial_size=1)
demography.add_population(name="B", initial_size=1)
msprime.sim_ancestry(samples={"A": 1, "B": 1}, demography=demography)
---------------------------------------------------------------------------
LibraryError                              Traceback (most recent call last)
Cell In[105], line 4
      2 demography.add_population(name="A", initial_size=1)
      3 demography.add_population(name="B", initial_size=1)
----> 4 msprime.sim_ancestry(samples={"A": 1, "B": 1}, demography=demography)

File ~/work/tskit-site/tskit-site/msprime/ancestry.py:1311, in sim_ancestry(samples, demography, sequence_length, discrete_genome, recombination_rate, gene_conversion_rate, gene_conversion_tract_length, population_size, ploidy, model, initial_state, start_time, end_time, record_migrations, record_full_arg, additional_nodes, coalescing_segments_only, num_labels, random_seed, num_replicates, replicate_index, record_provenance)
   1289     provenance_dict = provenance.get_provenance_dict(parameters)
   1290 sim = _parse_sim_ancestry(
   1291     samples=samples,
   1292     sequence_length=sequence_length,
   (...)
   1309     random_seed=random_seed,
   1310 )
-> 1311 return _wrap_replicates(
   1312     sim,
   1313     num_replicates=num_replicates,
   1314     replicate_index=replicate_index,
   1315     provenance_dict=provenance_dict,
   1316 )

File ~/work/tskit-site/tskit-site/msprime/ancestry.py:681, in _wrap_replicates(simulator, num_replicates, replicate_index, provenance_dict, mutation_rate)
    675 iterator = simulator.run_replicates(
    676     num_replicates,
    677     mutation_rate=mutation_rate,
    678     provenance_dict=provenance_dict,
    679 )
    680 if replicate_index is not None:
--> 681     deque = collections.deque(iterator, maxlen=1)
    682     return deque.pop()
    683 else:

File ~/work/tskit-site/tskit-site/msprime/ancestry.py:1606, in Simulator.run_replicates(self, num_replicates, mutation_rate, provenance_dict)
   1604 for replicate_index in range(num_replicates):
   1605     logger.info("Starting replicate %d", replicate_index)
-> 1606     self.run()
   1607     if mutation_rate is not None:
   1608         # This is only called from simulate() or the ms interface,
   1609         # so does not need any further parameters.
   1610         mutations._simple_mutate(
   1611             tables=self.tables,
   1612             random_generator=self.random_generator,
   1613             sequence_length=self.sequence_length,
   1614             rate=mutation_rate,
   1615         )

File ~/work/tskit-site/tskit-site/msprime/ancestry.py:1571, in Simulator.run(self, event_chunk, debug_func)
   1569     raise ValueError("Model durations must be >= 0")
   1570 end_time = min(self.time + model_duration, self.end_time)
-> 1571 exit_reason = self._run_until(end_time, event_chunk, debug_func)
   1572 if exit_reason == ExitReason.COALESCENCE or self.time == self.end_time:
   1573     logger.debug("Skipping remaining %d models", len(self.models) - j - 1)

File ~/work/tskit-site/tskit-site/msprime/ancestry.py:1536, in Simulator._run_until(self, end_time, event_chunk, debug_func)
   1534 ret = ExitReason.MAX_EVENTS
   1535 while ret == ExitReason.MAX_EVENTS:
-> 1536     ret = ExitReason(super().run(end_time, event_chunk))
   1537     if self.time > end_time:
   1538         # Currently the Pedigree and Sweeps models are "non-reentrant"
   1539         # We can change this to an assertion once these have been fixed.
   1540         raise RuntimeError(
   1541             f"Model {self.model['name']} does not support interruption. "
   1542             "Please open an issue on GitHub"
   1543         )

LibraryError: Infinite waiting time until next simulation event.