Ancestry simulations

Msprime simulates the ancestry of sampled genomes under a number of different backwards-in-time population genetic models. The simulated ancestry is represented as a succinct tree sequence using the tskit library, which provides an extensive suite of operations for analysing genealogical trees and DNA sequence data. See the Getting started with tskit tutorial for more information on how to process data using tskit.

Here we run a simple simulation and show a summary of the resulting tree sequence:

ts = msprime.sim_ancestry(2)
ts
Tree Sequence
Trees1
Sequence Length1.0
Sample Nodes4
Total Size1.6 KiB
MetadataNo Metadata
Table Rows Size Has Metadata
Edges 6 172 Bytes
Individuals 2 44 Bytes
Migrations 0 4 Bytes
Mutations 0 8 Bytes
Nodes 7 172 Bytes
Populations 1 216 Bytes
Provenances 1 962 Bytes
Sites 0 8 Bytes

The sim_ancestry() function has many parameters to specify the details of the simulations. The API documentation contains a precise description of the parameter; the sections in this page provide some explanation and examples of how to use these parameters.

Note

It’s important to note that sim_ancestry() only simulates the ancestral trees for samples: if we want actual sequence data then we also need to simulate mutations on these trees. See the Mutation simulations section for more information on how to do this.


Quick reference

sim_ancestry()

Simulate ancestral tree sequence topology

Models

StandardCoalescent

Coalescent with recombination (“hudson”)

SmcApproxCoalescent

Sequentially Markov Coalescent (“smc”)

SmcPrimeApproxCoalescent

SMC’(“smc_prime”)

DiscreteTimeWrightFisher

Generation-by-generation Wright-Fisher

BetaCoalescent

Beta coalescent multiple-merger

DiracCoalescent

Dirac coalescent multiple-merger

SweepGenicSelection

Selective sweep at a linked locus


Specifying samples

The samples argument to sim_ancestry() defines the number of sampled individuals whose history will be simulated. There are three different forms; the samples argument can be:

  • an integer, interpreted as the number of sampled individuals to draw in a single population model;

  • a dictionary mapping population references (either integer IDs or names) to the number of sampled individuals for that population;

  • a list of SampleSet objects, which provide more flexibility in how groups of similar samples are drawn from populations.

In the simplest case, provide a single integer which defines the number of samples:

ts = msprime.sim_ancestry(2, random_seed=1234)
ts
Tree Sequence
Trees1
Sequence Length1.0
Sample Nodes4
Total Size1.6 KiB
MetadataNo Metadata
Table Rows Size Has Metadata
Edges 6 172 Bytes
Individuals 2 44 Bytes
Migrations 0 4 Bytes
Mutations 0 8 Bytes
Nodes 7 172 Bytes
Populations 1 216 Bytes
Provenances 1 958 Bytes
Sites 0 8 Bytes

Here we specify two sample individuals, which at the default ploidy of two, gives us four sample nodes.

Warning

It is important to note that the number of samples refers to the number of individuals not the number of nodes (monoploid genomes). See the Ploidy section for details.

This integer form for the samples argument can only be used when the demography is a single population model.

Populations

The next example illustrates one usage of the dictionary form of the samples argument, in which the keys refer to populations in the demographic model and the values are the number of samples to draw. (See the Demography section for more details on the demography argument to sim_ancestry().) We first create a Demography object representing a 10 deme linear stepping stone model. Then, we run the simulation with 3 diploid samples each drawn from the first and last demes in this linear habitat.

N = 10
demography = msprime.Demography.stepping_stone_model(
    [100] * N,
    migration_rate=0.1,
    boundaries=True)
ts = msprime.sim_ancestry({0: 3, N - 1: 3}, demography=demography)
ts
Tree Sequence
Trees1
Sequence Length1.0
Sample Nodes12
Total Size5.2 KiB
MetadataNo Metadata
Table Rows Size Has Metadata
Edges 22 620 Bytes
Individuals 6 108 Bytes
Migrations 0 4 Bytes
Mutations 0 8 Bytes
Nodes 23 556 Bytes
Populations 10 549 Bytes
Provenances 1 3.2 KiB
Sites 0 8 Bytes

The keys in the dictionary can also be the string names of the population (see the Identifying populations section ), which is useful when we are simulating from empirically estimated models. For example, here create a Demography object based on a species tree, and then draw samples using the species names.

demography = msprime.Demography.from_species_tree(
    "(((human:5.6,chimpanzee:5.6):3.0,gorilla:8.6):9.4,orangutan:18.0)",
    time_units="myr",
    initial_size=10**4,
    generation_time=20)
ts = msprime.sim_ancestry({"gorilla": 2, "human": 4}, demography=demography)
ts
Tree Sequence
Trees1
Sequence Length1.0
Sample Nodes12
Total Size4.6 KiB
MetadataNo Metadata
Table Rows Size Has Metadata
Edges 22 620 Bytes
Individuals 6 108 Bytes
Migrations 0 4 Bytes
Mutations 0 8 Bytes
Nodes 23 556 Bytes
Populations 7 449 Bytes
Provenances 1 2.7 KiB
Sites 0 8 Bytes

When samples are drawn using this dictionary form it is always at the population’s Default sampling time. If you wish to draw samples at a different time, then you must use the more general form of a list of SampleSet objects.

See also

See the Demography section for more information on how to specify demographic models in a simulation.

Sampling time

By default the samples that we draw from a Population are at the population’s default_sampling_time. See the Default sampling time section for more information about how this is set for populations in a demographic model.

We can manually control the time at which samples are drawn using list of SampleSet objects form for the samples argument.

In this example we create two diploid sample individuals, one at the present time and one taken 50 generations ago, representing one modern and one ancient individual. By default, when we draw the trees, the nodes that belong to samples are drawn as squares whereas nodes that are not samples are drawn as circles.

ts = msprime.sim_ancestry(
    samples = [
        msprime.SampleSet(1),
        msprime.SampleSet(1, time=50)
    ],
    population_size=100,
    random_seed=42
)
SVG(ts.draw_svg(y_axis=True))
_images/ancestry_10_0.svg

See also

See the Order of event execution section for details of the relative order in which sampling events occur.

Sample details

Sample individuals and nodes are allocated sequentially in the order that they are specified. For example:

demography = msprime.Demography.island_model([10, 10], migration_rate=1)
ts = msprime.sim_ancestry(
    samples=[
        msprime.SampleSet(1, population=1),
        msprime.SampleSet(2, population=0)],
    demography=demography,
    end_time=0)
print(ts.tables.individuals)
print(ts.tables.nodes)
╔══╤═════╤════════╤═══════╤════════╗
║id│flags│location│parents│metadata║
╠══╪═════╪════════╪═══════╪════════╣
║0 │    0│        │       │     b''║
║1 │    0│        │       │     b''║
║2 │    0│        │       │     b''║
╚══╧═════╧════════╧═══════╧════════╝

╔══╤═════╤══════════╤══════════╤══════════╤════════╗
║id│flags│population│individual│time      │metadata║
╠══╪═════╪══════════╪══════════╪══════════╪════════╣
║0 │    1│         1│         0│0.00000000│     b''║
║1 │    1│         1│         0│0.00000000│     b''║
║2 │    1│         0│         1│0.00000000│     b''║
║3 │    1│         0│         1│0.00000000│     b''║
║4 │    1│         0│         2│0.00000000│     b''║
║5 │    1│         0│         2│0.00000000│     b''║
╚══╧═════╧══════════╧══════════╧══════════╧════════╝

(Because we’re only interested in the sampled nodes and individuals we stopped the simulation from actually doing anything by setting end_time=0 — see the Setting the start time section for more information.) Here we define three sample individuals, and we therefore have three rows in the individual table. Because these are diploid individuals, the node table contains six sample nodes. If we look at the individual column in the node table we can see that the first two nodes correspond to individual 0, the next two nodes individual 1, etc. The sample configuration stated that the first sample should come from population 1 and the other two from population 0, and we can see this reflected in the population column of the node table. (Somewhat confusingly, population values are associated with nodes rather than individuals; this is mostly for historical reasons.)

The SampleSet class has a number of attributes which default to None. If these are set they will override the values which might be specified elsewhere. For example, we can specify mixed ploidy samples via the ploidy attribute of SampleSet:

ts = msprime.sim_ancestry(
    samples=[
        msprime.SampleSet(1, ploidy=3),
        msprime.SampleSet(2)],
    ploidy=1,
    end_time=0)
print(ts.tables.individuals)
print(ts.tables.nodes)
╔══╤═════╤════════╤═══════╤════════╗
║id│flags│location│parents│metadata║
╠══╪═════╪════════╪═══════╪════════╣
║0 │    0│        │       │     b''║
║1 │    0│        │       │     b''║
║2 │    0│        │       │     b''║
╚══╧═════╧════════╧═══════╧════════╝

╔══╤═════╤══════════╤══════════╤══════════╤════════╗
║id│flags│population│individual│time      │metadata║
╠══╪═════╪══════════╪══════════╪══════════╪════════╣
║0 │    1│         0│         0│0.00000000│     b''║
║1 │    1│         0│         0│0.00000000│     b''║
║2 │    1│         0│         0│0.00000000│     b''║
║3 │    1│         0│         1│0.00000000│     b''║
║4 │    1│         0│         2│0.00000000│     b''║
╚══╧═════╧══════════╧══════════╧══════════╧════════╝

(Again, we stop the simulation immediately because we’re only interested in the initial samples.) Here we have three sampled individuals again but in this they are mixed ploidy: the first individual is triploid and the other two are haploid.

Warning

It is vital to note that setting the ploidy value of SampleSet objects only affects sampling and does not affect the actual simulation. In this case, the simulation will be run on the haploid time scale. See the Ploidy section for more details.

If you wish to set up the node and individual IDs in some other way, it is possible to do so by using the initial_state parameter. See the Specifying the initial state for more information on how to use this (advanced) feature.

Demography

A demographic model is a description of a set of populations, the migration rates between then and the events that modify the populations over time. Ancestry simulations can take one of these models as input via the demography parameter to sim_ancestry().

See also

Please see the Demographic models section for details on how to create and debug demographic models, and the Specifying samples section for help on how to define the locations and times of sampled individuals.

We don’t need to explicitly specify a Demography for every ancestry simulation, however. By default, we assume a single fixed-size population. For the default standard coalescent ancestry model we assume a population size of 1, so that time is measured in “coalescent units”. This is useful for theoretical work where we usually work in scaled coalescent time.

Important

By default msprime uses the diploid coalescent time scale; see the Ploidy section for more details.

Outside of theoretical analysis, working with scaled coalescent time can be confusing and it’s usually better to explicitly set the population size so that the output times are directly interpretable in generations. The population_size parameter to sim_ancestry() allows us to set the size of the single constant sized population in the default demography.

For example, here we run the same simulation twice, one with the default population size and one with a population size of 100:

ts = msprime.sim_ancestry(2, random_seed=1234)
SVG(ts.draw_svg(y_axis=True))
_images/ancestry_16_0.svg
ts = msprime.sim_ancestry(2, population_size=100, random_seed=1234)
SVG(ts.draw_svg(y_axis=True))
_images/ancestry_17_0.svg

We can see that the results are identical, except the node times have been scaled by the population size.

The Discrete Time Wright-Fisher model is different. Because it explicitly works with a population of N individuals we get an error if we don’t specify a population size:

msprime.sim_ancestry(2, model="dtwf", random_seed=1234)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_2837/711429931.py in <module>
----> 1 msprime.sim_ancestry(2, model="dtwf", random_seed=1234)

~/work/tskit-site/tskit-site/msprime/ancestry.py in sim_ancestry(samples, demography, sequence_length, discrete_genome, recombination_rate, gene_conversion_rate, gene_conversion_tract_length, population_size, ploidy, model, initial_state, start_time, end_time, record_migrations, record_full_arg, num_labels, random_seed, num_replicates, replicate_index, record_provenance)
   1140         frame = inspect.currentframe()
   1141         provenance_dict = _build_provenance("sim_ancestry", random_seed, frame)
-> 1142     sim = _parse_sim_ancestry(
   1143         samples=samples,
   1144         sequence_length=sequence_length,

~/work/tskit-site/tskit-site/msprime/ancestry.py in _parse_sim_ancestry(samples, sequence_length, recombination_rate, gene_conversion_rate, gene_conversion_tract_length, discrete_genome, population_size, demography, ploidy, model, initial_state, start_time, end_time, record_migrations, record_full_arg, num_labels, random_seed, init_for_debugger)
    893             # an error.
    894             if population_size is None:
--> 895                 raise ValueError(
    896                     "When using the DTWF model, the population size must be set "
    897                     "explicitly, either using the population_size or demography "

ValueError: When using the DTWF model, the population size must be set explicitly, either using the population_size or demography arguments.

Note

Because the DTWF models populations of individuals rather than using continuous time approximations, simulations with different population sizes will yield entirely different results; we are not simply rescaling time by using different population sizes. Furthermore, simulations with large population sizes will take relatively more CPU time and memory.


Ploidy

The ploidy argument to sim_ancestry() has two effects:

  1. It sets the default ploidy (i.e., number sample nodes) of sample individuals.

  2. For continuous time coalescent models, it sets the time scale.

Both of these are somewhat confusing, so let’s look at them one at a time.

Default number of nodes per sample individual

As discussed in the Sample details section, the number of nodes (i.e., monoploid genomes) that we allocate per sample individual is determined by the ploidy. Here, we specify 2 sample individuals with a ploidy of 3:

ts = msprime.sim_ancestry(samples=2, ploidy=3, end_time=0)
print(ts.tables.individuals)
print(ts.tables.nodes)
╔══╤═════╤════════╤═══════╤════════╗
║id│flags│location│parents│metadata║
╠══╪═════╪════════╪═══════╪════════╣
║0 │    0│        │       │     b''║
║1 │    0│        │       │     b''║
╚══╧═════╧════════╧═══════╧════════╝

╔══╤═════╤══════════╤══════════╤══════════╤════════╗
║id│flags│population│individual│time      │metadata║
╠══╪═════╪══════════╪══════════╪══════════╪════════╣
║0 │    1│         0│         0│0.00000000│     b''║
║1 │    1│         0│         0│0.00000000│     b''║
║2 │    1│         0│         0│0.00000000│     b''║
║3 │    1│         0│         1│0.00000000│     b''║
║4 │    1│         0│         1│0.00000000│     b''║
║5 │    1│         0│         1│0.00000000│     b''║
╚══╧═════╧══════════╧══════════╧══════════╧════════╝

Note

Because we’re only interested in the initial conditions here we set the end_time to zero: see the Setting the start time section for more details.

Our individual table has two rows (the number of samples) and node table has 6 rows: three for each of the sampled individuals.

Note that this is the default, and can be overridden by the SampleSet.ploidy value, as described in the Sample details section.

Coalescent time scales

The ploidy argument also affects the time scale on which coalescence occurs, since it affects the total number of genomes in the population (see also population size). For example, consider the following two simulations, which are identical except for different ploidy values (note we use the SampleSet object to make sure we allocate the same number of nodes, as discussed in the Default number of nodes per sample individual section):

ts = msprime.sim_ancestry(
    samples=[msprime.SampleSet(4, ploidy=1)],
    population_size=1,
    ploidy=1,
    random_seed=1234
)
SVG(ts.draw_svg(y_axis=True))
_images/ancestry_23_0.svg
ts = msprime.sim_ancestry(
    samples=[msprime.SampleSet(4, ploidy=1)],
    population_size=1,
    ploidy=2,
    random_seed=1234
)
SVG(ts.draw_svg(y_axis=True))
_images/ancestry_24_0.svg

The resulting tree sequences are identical, except that node time differ by a factor of 2, because in the underlying coalescent algorithm the total number of genomes in the population acts as a time scaling.

Todo

Add a discussion here here about the Lambda coalescents an how things are more complicated for them.

Important

The Discrete Time Wright-Fisher model is an explicitly diploid model and it is therefore an error to run it with any ploidy value other than 2.

Genome properties

Sequence length

There are a number of different ways to specify the length of the chromosome that we want to simulate. In the absence of recombination and gene conversion, we assume a genome of length 1:

ts = msprime.sim_ancestry(2, random_seed=1)
SVG(ts.draw_svg())
_images/ancestry_26_0.svg

The sequence_length parameter determines the length of the simulated sequence:

ts = msprime.sim_ancestry(2, sequence_length=100, random_seed=1)
SVG(ts.draw_svg())
_images/ancestry_28_0.svg

By default there is no recombination or gene conversion, and so we therefore still have only one tree defining the ancestry of the samples. The sequence length makes little difference if we are considering only the ancestry of a non-recombining sequence, however, this will be important if we are interested in adding mutations later.

Recombination

As we trace the ancestry of a sample of genomes backwards in time, recombination has the effect of creating segments of genome that have different (but highly correlated) genealogical trees. The sim_ancestry() function has a recombination_rate argument that allows us to flexibly specify both uniform and non-uniform rates of recombination along the genome.

The simplest way to run simulations with recombination is to use the recombination_rate and sequence_length parameters:

ts = msprime.sim_ancestry(
    3, recombination_rate=0.1, sequence_length=10, random_seed=2)
SVG(ts.draw_svg())
_images/ancestry_30_0.svg

Here, we we have four distinct trees along the genome, created by recombination events.

We can also provide a RateMap argument to recombination_rate:

rate_map = msprime.RateMap.uniform(sequence_length=10, rate=0.1)
other_ts = msprime.sim_ancestry(3, recombination_rate=rate_map, random_seed=2)
assert ts.equals(other_ts, ignore_provenance=True)

There are two things to note here:

  1. We didn’t need to specify a sequence_length argument to sim_ancestry(), as this information is encapsulated by the rate_map instance. (We can provide a value for sequence_length if we wish, but it must be equal to the sequence_length property of the RateMap).

  2. The result of the simulation for a Uniform rate maps rate map with the same rate and sequence_length is identical to the first simulation above.

We can simulate non-uniform recombination by defining our RateMap appropriately. For example, here we define a map in which the rate over the first half of the sequence is one tenth of the second half:

rate_map = msprime.RateMap(position=[0, 10, 20], rate=[0.01, 0.1])
ts = msprime.sim_ancestry(3, recombination_rate=rate_map, random_seed=2)
SVG(ts.draw_svg())
_images/ancestry_36_0.svg

Because the recombination rate in from position 10 to 20 is so much higher than the rate from 0 to 10, we can see that all the recombinations have fallen in the high recombination region.

See also

Gene conversion

Gene conversion events are defined by two parameters: the rate at which gene conversion events are initiated and the distribution of tract lengths. In the default case of discrete genome coordinates, tract lengths are drawn from a geometric distribution with mean gene_conversion_tract_length (which must be at least 1). Note that if we specify a tract length of 1, then all gene conversion tracts will have length exactly 1. In the following example one gene conversion event of length 1 has occurred:

ts = msprime.sim_ancestry(
    3, gene_conversion_rate=0.02, gene_conversion_tract_length=1,
    sequence_length=10, random_seed=3)
SVG(ts.draw_svg())
_images/ancestry_38_0.svg

Note

We can see the specific effect of gene conversion here in this simple example, because the trees at either side of the event are identical. However, this will not be true in general.

Continuous genomes can also be used. In this case the parameters define the rate at which gene conversion events are initiated per unit of sequence length and the mean of the exponentially distributed gene conversion tract lengths. The following example shows the same simulation as above but for a continuous genome:

ts = msprime.sim_ancestry(
    3, gene_conversion_rate=0.02, gene_conversion_tract_length=1,
    sequence_length=10, random_seed=3, discrete_genome=False)
SVG(ts.draw_svg())
_images/ancestry_42_0.svg

Recombination and gene conversion at constant rates can be simulated alongside. In the following example recombinations at site 60 and 97 have occurred in addition to a gene conversion event covering the tract from site 76 to site 80.

ts = msprime.sim_ancestry(
    3, sequence_length = 100, recombination_rate=0.003,
    gene_conversion_rate=0.002, gene_conversion_tract_length=5,
    random_seed=6)
SVG(ts.draw_svg())
_images/ancestry_45_0.svg

Variable recombination rates and constant gene conversion rates can also be combined.

Note

Variable rates of gene conversion breakpoint initiation are not currently supported, but are planned for a future release. If you are interested in this feature please let us know!

Discrete or continuous?

By default, we assume that the genome we are simulating is discrete so that genome coordinates are at integer positions:

ts = msprime.sim_ancestry(
    3, recombination_rate=0.1, sequence_length=10, random_seed=2)
SVG(ts.draw_svg())
_images/ancestry_48_0.svg

We can also simulate a continous genome by setting discrete_genome=False:

ts = msprime.sim_ancestry(
    3, recombination_rate=0.25, sequence_length=1, discrete_genome=False,
    random_seed=33)
SVG(ts.draw_svg())
_images/ancestry_51_0.svg

Here we see that the breakpoints along the genome occur at floating point positions. Simulating a continuous genome sequence can be useful for theoretical work, but we recommend using discrete coordinates for most purposes.

Note

With a discrete genome and a recombination rate of \(r\), the expected number of recombinations per unit time and per unit length is \(1-e^{-r}\). Therefore the total map length for a segment of length \(L\) and recombination rate \(r\) is \(L (1 - e^{-r})\) and so \(Lr (1 - r/2) < L (1 - e^{-r}) < Lr\).

Multiple chromosomes

Msprime does not directly support simulating multiple chromosomes simultaneously, but we can emulate it using a single linear genome split into chromosome segments. Multiple chromosomes are modelled by specifying a recombination map with single base-pair segments with recombination probability 1/2 separating adjacent chromosomes. The probability that a recombination breakpoint occurs over one unit of time at a given base pair that has recombination rate \(r\) is \(1-e^{-r}\). Thus, we set the recombination rate in the base pair segments separating chromosomes to \(\log(2)\). This ensures that the recombination probability each generation is 1/2.

Do I need to do this?

The main purpose of simulating multiple chromosomes together (instead of running independent simulations for each chromosome) is to capture correlations between trees on distinct chromosomes. In many scenarios, simulating multiple chromosomes may not be necessary, and independent simulations may be preferable as they will be much more efficient.

Nelson et al. 2020 describes a few cases where simulating multiple chromosomes using the DTWF is important for accurately modeling shared patterns of variation between samples. In large simulated cohorts, where many samples can be close relatives, we need the DTWF model and multiple chromosomes to obtain the correct distribution of the number and total length of shared IBD segments. We similarly require multiple chromosomes and the DTWF model to get the correct ancestry patterns for very recently admixed populations.

Additionally, Multiple merger coalescents result in positively correlated ancestries between unlinked chromosomes (see Birkner et al. 2013). This correlation does not break down in time and multiple chromosomes should not be simulated independently under the multiple merger models. Unlinked chromosomes should scatter into distinct ancestors instantaneously under multiple merger coalescents, not with probability 1/2 per generation as in discrete population models. In msprime, this waiting time of length zero is approximated by an exponential distribution with rate \(r\), i.e. the recombination rate at the base pair separating the chromosomes, which should be set to a high enough value to obtain an acceptable approximation. We recommend considering in particular the relative error in comparison to the magnitudes other waiting times which are likely to arise in the simulation.

Important

Simulations of multiple chromosomes under either DTWF or the multiple merger models should use a discrete genome. While it is possible to set up such a simulation using a continuous genome, we strongly recommend against it, as chromosome divisions cannot be defined as discrete break points and the simulation will be very inefficient.

Example

We first set up the chromosome coordinates, recombination rates, and the corresponding recombination map. The following defines a recombination map for three chromosomes each 1 cM in length:

import math

r_chrom = 1e-8
r_break = math.log(2)
chrom_positions = [0, 1e6, 2e6, 3e6]
map_positions = [
    chrom_positions[0],
    chrom_positions[1],
    chrom_positions[1] + 1,
    chrom_positions[2],
    chrom_positions[2] + 1,
    chrom_positions[3]
]
rates = [r_chrom, r_break, r_chrom, r_break, r_chrom]
rate_map = msprime.RateMap(position=map_positions, rate=rates)

We cannot use the default StandardCoalescent model to run simulations of multiple chromosomes; doing so would be equivalent to running independent replicate simulations and much less efficient. To get the correct correlation between chromosomes, we must use a model like discrete-time Wright-Fisher. However, because correlations between chromosomes break down very rapidly, we only need to simulate using dtwf for roughly 10-20 generations, after which we can switch to the hudson model to finish the simulation more efficiently (see the Specifying ancestry models section for more information on switching between ancestry models).

In this example, we simulate 10 sampled diploid individuals using the recombination map defined above for three chromosomes and switching from the dtwf to hudson model 20 generations ago:

ts = msprime.sim_ancestry(
    10,
    population_size=1000,
    recombination_rate=rate_map,
    model=[
        msprime.DiscreteTimeWrightFisher(duration=20),
        msprime.StandardCoalescent(),
    ],
    random_seed=1234,
)
ts
Tree Sequence
Trees351
Sequence Length3000000.0
Sample Nodes20
Total Size53.8 KiB
MetadataNo Metadata
Table Rows Size Has Metadata
Edges 1254 34.3 KiB
Individuals 10 172 Bytes
Migrations 0 4 Bytes
Mutations 0 8 Bytes
Nodes 342 8.0 KiB
Populations 1 216 Bytes
Provenances 1 1.3 KiB
Sites 0 8 Bytes

To place each chromosome in its own separate tree sequence, we can trim the full tree sequence for each chromosome by defining the end points of each chromosome and using the keep_intervals() method. It is important to specify simplify=False so that node indices remain consistent across chromosomes. The trim() method is then used to trim the tree sequences to the focal chromosome, so that position indices begin at 0 within each chromosome tree sequence.

ts_chroms = []
for j in range(len(chrom_positions) - 1):
    start, end = chrom_positions[j: j + 2]
    chrom_ts = ts.keep_intervals([[start, end]], simplify=False).trim()
    ts_chroms.append(chrom_ts)
    print(chrom_ts.sequence_length)

# the third chromosome
chrom_ts
1000000.0
1000000.0
1000000.0
Tree Sequence
Trees102
Sequence Length1000000.0
Sample Nodes20
Total Size23.7 KiB
MetadataNo Metadata
Table Rows Size Has Metadata
Edges 371 10.1 KiB
Individuals 10 172 Bytes
Migrations 0 4 Bytes
Mutations 0 8 Bytes
Nodes 342 8.0 KiB
Populations 1 216 Bytes
Provenances 3 2.2 KiB
Sites 0 8 Bytes

This gives us a list of tree sequences, one for each chromosome, in the order that they were stitched together in the initial recombination map.

Recording more information

By default msprime stores the minimum amount of information required to represent the simulated genealogical history of the samples. Sometimes we are interested in more detailed information; this section gives details of options that allow us to do this.

Ancestral recombination graph

In msprime we usually want to simulate the coalescent with recombination and represent the output as efficiently as possible. As a result, we don’t store individual recombination events, but rather their effects on the output tree sequence. We also do not explicitly store common ancestor events that do not result in marginal coalescences. For some purposes, however, we want to get information on the full history of the simulation, not just the minimal representation of its outcome. The record_full_arg option to sim_ancestry() provides this functionality, as illustrated in the following example:

ts = msprime.sim_ancestry(
    3, recombination_rate=0.1, sequence_length=2,
    record_full_arg=True, random_seed=42)
print(ts.tables.nodes)
SVG(ts.draw_svg())
╔══╤══════╤══════════╤══════════╤══════════╤════════╗
║id│flags │population│individual│time      │metadata║
╠══╪══════╪══════════╪══════════╪══════════╪════════╣
║0 │     1│         0│         0│0.00000000│     b''║
║1 │     1│         0│         0│0.00000000│     b''║
║2 │     1│         0│         1│0.00000000│     b''║
║3 │     1│         0│         1│0.00000000│     b''║
║4 │     1│         0│         2│0.00000000│     b''║
║5 │     1│         0│         2│0.00000000│     b''║
║6 │     0│         0│        -1│0.21230674│     b''║
║7 │     0│         0│        -1│0.51485157│     b''║
║8 │     0│         0│        -1│0.71161449│     b''║
║9 │131072│         0│        -1│0.91107705│     b''║
║10│131072│         0│        -1│0.91107705│     b''║
║11│262144│         0│        -1│1.21743776│     b''║
║12│     0│         0│        -1│1.23130397│     b''║
║13│     0│         0│        -1│4.80416300│     b''║
╚══╧══════╧══════════╧══════════╧══════════╧════════╝
_images/ancestry_60_1.svg

After running the simulation we first print out the node table, which contains information on all the nodes in the tree sequence. Note that flags column contains several different values: all of the sample nodes (at time 0) have a flag value of 1 (tskit.NODE_IS_SAMPLE). Most other nodes have a flag value of 0, which is the standard for internal nodes in a coalescent simulations.

Nodes 9 and 10 have flags equal to 131072 (NODE_IS_RE_EVENT), which tells us that they correspond to a recombination event in the ARG. A recombination event results in two extra nodes being recorded, one identifying the individual providing the genetic material to the left of the breakpoint and the other identifying the individuals providing the genetic material to the right. The effect of this extra node can be seen in the trees: node 9 is present as a ‘unary’ node in the left hand tree and node 10 in the right.

In this particular case, the first thing that happens to these two lineages as we go up the tree sequence is that they coalesce straight back together again at node 11, forming a (normally undetectable) “diamond event” in the ARG, and explaining why the topology of both these trees appears the same. In a full ARG, this is one of many ways that a coalescent event can occur without resulting in a marginal coalescence, and which instead results in an additional unary node in the trees. Such nodes are given a flags value of 262144 (NODE_IS_CA_EVENT).

If we wish to reduce these trees down to the minimal representation, we can use tskit.TreeSequence.simplify(). The resulting tree sequence will have all of these unary nodes removed and will be equivalent to (but not identical, due to stochastic effects) calling sim_ancestry() without the record_full_arg argument.

Migrations nodes are also recording in the ARG using the NODE_IS_MIG_EVENT flag. See the Node flags section for more details.

Migration events

When running simulations with demography, by default, migration events are not recorded in the resulting tree sequence. While the population associated with each node is determined by the simulated migration events, knowledge of the precise sequence of migrations is useful in a number of situations: for example see the tskit tutorial on introgression.

The record_migrations parameter allows us to record these events in the simulated tree sequence outputted by sim_ancestry(). Briefly, a migration record encodes the node associated with the migration, the time of the event, the source and destination populations (backwards in time), and the left and right coordinates of the migrating segement. For more details on migration records, see the migration table definition.

Here, we provide a simple example of the effect of setting record_migrations=True using the stepping stone model where migration is permitted between adjacent populations. Additionally, we set record_full_arg=True (see previous section) to record nodes corresponding to migration events. This is not necessary, but will be helpful to visualise the result.

N = 10
demography = msprime.Demography.stepping_stone_model(
    [20] * N, migration_rate=0.1, boundaries=True
)
ts = msprime.sim_ancestry(
    {0: 1, N - 1: 1},
    demography=demography,
    record_migrations=True,
    record_full_arg=True,
    random_seed=6,
)

In the following visualisation, we trace the migration history of two samples backwards in time to their most recent common ancestor. Each one of the ten populations in the model is assigned a colour and a position on the x axis. The time of migration events is recorded on the y axis.

import tskit
import numpy as np
import matplotlib

cmap = matplotlib.cm.get_cmap('viridis')
tree = ts.first()
mig_color_dict = {}
for u, linestyle in zip(ts.samples()[::2], ["dotted", "dashdot"]):
    mig = ts.tables.migrations
    loc = []
    time = []
    while u != tskit.NULL:
        migs = np.where(mig.node == u)[0]
        for cur_mig in migs:
            cur_mig = mig[cur_mig]
            loc.append(cur_mig.dest + 0.5)
            time.append(cur_mig.time)
        u = tree.parent(u)
    plt.plot(loc, time, linestyle=linestyle, color="black")
for i in range(N):
    rgba = cmap(i / N)
    mig_color_dict[i] = np.array(rgba)*255
    plt.axvspan(i, i+1, facecolor=rgba, alpha=0.5)
plt.ylabel("Time of migration event")
plt.xlabel("Population")
plt.show()
_images/ancestry_66_0.svg

Note also how many migration events occurred before the two samples coalesced, even in this small example.

The next visualisation shows where each of these migrations fall on the edges of the simulated tree. We show the nodes associated with each migration event, colouring nodes and edges by the population with which they are associated (using the same colour scheme as the previous visualisation).

mig_color_dict = {}
cmap = matplotlib.cm.get_cmap("viridis")
for i in range(N):
    rgba = cmap(i / N)
    mig_color_dict[i] = np.array(rgba) * 255

edge_colors = {}
string = ".tree .edge {stroke-width:4px}"
pop = ts.tables.nodes.population
for node in ts.nodes():
    node = node.id
    edge_colors[node] = mig_color_dict[pop[node]]
    string += (
        ".node.n"
        + str(node)
        + "> .sym {fill: rgb"
        + str(tuple(mig_color_dict[pop[node]][0:3]))
        + "}"
    )
    string += (
        ".node.n"
        + str(node)
        + " > .edge {stroke: rgb"
        + str(tuple(mig_color_dict[pop[node]][0:3]))
        + "}"
    )
for mig in ts.migrations():
    edge_colors[mig.node] = mig_color_dict[mig.source]
    string += (
        ".node.n"
        + str(mig.node)
        + "> .sym {fill: rgb"
        + str(tuple(mig_color_dict[pop[mig.node]][0:3]))
        + "}"
    )
    string += (
        ".node.n"
        + str(mig.node)
        + " > .edge {stroke: rgb"
        + str(tuple(mig_color_dict[mig.dest][0:3]))
        + "}"
    )
node_labels = {node.id: "" for node in ts.nodes()}
SVG(
    ts.first().draw_svg(
        size=(500, 400), style=string, node_labels=node_labels, symbol_size=10
    )
)
_images/ancestry_68_0.svg

Census events

Census events allow you to add a node to each branch of the tree sequence at a given time during the simulation. This can be useful when you wish to study haplotypes that are ancestral to your simulated sample, or when you wish to know which lineages were present in which populations at specified times.

For instance, the following example creates a two population island model Demography, with the populations exchanging migrants at rate 0.05. At generation 5000, we add a census event using the Demography.add_census() method to determine where each of the lineages is at that time:

demography = msprime.Demography.island_model([1000, 1000], migration_rate=0.05)
demography.add_census(time=5000)
ts = msprime.sim_ancestry(
    {0: 1, 1: 1},
    demography=demography,
    recombination_rate=1e-7,
    sequence_length=1000,
    random_seed=141
)
SVG(ts.draw_svg())
_images/ancestry_70_0.svg

The resulting tree sequence has nodes on each tree at the specified census time. These are the nodes with IDs 8, 9, 10, 11, 12 and 13.

This tells us that the genetic material ancestral to the present day sample was held within 5 haplotypes at time 5000. The node table shows us that four of these haplotypes (nodes 8, 9, 10 and 11) were in population 0 at this time, and two of these haplotypes (nodes 12 and 13) were in population 1 at this time.

print(ts.tables.nodes)
╔══╤═══════╤══════════╤══════════╤═════════════╤════════╗
║id│flags  │population│individual│time         │metadata║
╠══╪═══════╪══════════╪══════════╪═════════════╪════════╣
║0 │      1│         0│         0│   0.00000000│     b''║
║1 │      1│         0│         0│   0.00000000│     b''║
║2 │      1│         1│         1│   0.00000000│     b''║
║3 │      1│         1│         1│   0.00000000│     b''║
║4 │      0│         1│        -1│2331.91109774│     b''║
║5 │      0│         1│        -1│3741.02811877│     b''║
║6 │      0│         0│        -1│4216.80416680│     b''║
║7 │      0│         1│        -1│4580.66488183│     b''║
║8 │1048576│         0│        -1│5000.00000000│     b''║
║9 │1048576│         0│        -1│5000.00000000│     b''║
║10│1048576│         0│        -1│5000.00000000│     b''║
║11│1048576│         0│        -1│5000.00000000│     b''║
║12│1048576│         1│        -1│5000.00000000│     b''║
║13│1048576│         1│        -1│5000.00000000│     b''║
║14│      0│         1│        -1│5232.65267392│     b''║
║15│      0│         0│        -1│8192.48105714│     b''║
╚══╧═══════╧══════════╧══════════╧═════════════╧════════╝

If we wish to study these ancestral haplotypes further, we can simplify the tree sequence with respect to the census nodes and perform subsequent analyses on this simplified tree sequence. In this example, ts_anc is a tree sequence obtained from the original tree sequence ts by labelling the census nodes as samples and removing all nodes and edges that are not ancestral to these census nodes.

nodes = [i.id for i in ts.nodes() if i.flags==msprime.NODE_IS_CEN_EVENT]
ts_anc = ts.simplify(samples=nodes)

Using the DTWF model

Census events assume that they are being used as part of a continuous time simulation process, in which multiple events are guaranteed not to happen at the same time. However, the DiscreteTimeWrightFisher model does not meet these assumptions, and allows multiple events (such as coalescences, migrations etc) to all occur at the same time. This includes census events, which can lead to errors when used in conjunction with the DTWF model.

Consider the following example:

demography = msprime.Demography()
demography.add_population(name="A", initial_size=10)
ts = msprime.sim_ancestry({"A": 3}, demography=demography, model="dtwf", random_seed=33)
SVG(ts.draw_svg(y_axis=True, time_scale="log_time"))
_images/ancestry_76_0.svg

We can see that a coalescence happens at time 3, giving node 7. Now, let’s try to perform a census at time 3:

demography = msprime.Demography()
demography.add_population(name="A", initial_size=10)
demography.add_census(time=3)
ts = msprime.sim_ancestry({"A": 3}, demography=demography, model="dtwf", random_seed=33)
---------------------------------------------------------------------------
LibraryError                              Traceback (most recent call last)
/tmp/ipykernel_2837/639428396.py in <module>
      2 demography.add_population(name="A", initial_size=10)
      3 demography.add_census(time=3)
----> 4 ts = msprime.sim_ancestry({"A": 3}, demography=demography, model="dtwf", random_seed=33)

~/work/tskit-site/tskit-site/msprime/ancestry.py in sim_ancestry(samples, demography, sequence_length, discrete_genome, recombination_rate, gene_conversion_rate, gene_conversion_tract_length, population_size, ploidy, model, initial_state, start_time, end_time, record_migrations, record_full_arg, num_labels, random_seed, num_replicates, replicate_index, record_provenance)
   1159         random_seed=random_seed,
   1160     )
-> 1161     return _wrap_replicates(
   1162         sim,
   1163         num_replicates=num_replicates,

~/work/tskit-site/tskit-site/msprime/ancestry.py in _wrap_replicates(simulator, num_replicates, replicate_index, provenance_dict, mutation_rate)
    666     )
    667     if replicate_index is not None:
--> 668         deque = collections.deque(iterator, maxlen=1)
    669         return deque.pop()
    670     else:

~/work/tskit-site/tskit-site/msprime/ancestry.py in run_replicates(self, num_replicates, mutation_rate, provenance_dict)
   1436         for replicate_index in range(num_replicates):
   1437             logger.info("Starting replicate %d", replicate_index)
-> 1438             self.run()
   1439             if mutation_rate is not None:
   1440                 # This is only called from simulate() or the ms interface,

~/work/tskit-site/tskit-site/msprime/ancestry.py in run(self, event_chunk, debug_func)
   1401                 raise ValueError("Model durations must be >= 0")
   1402             end_time = min(self.time + model_duration, self.end_time)
-> 1403             exit_reason = self._run_until(end_time, event_chunk, debug_func)
   1404             if exit_reason == ExitReason.COALESCENCE or self.time == self.end_time:
   1405                 logger.debug("Skipping remaining %d models", len(self.models) - j - 1)

~/work/tskit-site/tskit-site/msprime/ancestry.py in _run_until(self, end_time, event_chunk, debug_func)
   1366         ret = ExitReason.MAX_EVENTS
   1367         while ret == ExitReason.MAX_EVENTS:
-> 1368             ret = ExitReason(super().run(end_time, event_chunk))
   1369             if self.time > end_time:
   1370                 # Currently the Pedigree and Sweeps models are "non-reentrant"

LibraryError: The simulation model supplied resulted in a parent node having a time value <= to its child. This can occur either as a result of multiple bottlenecks happening at the same time or because of numerical imprecision with very small population sizes.

We got an error from msprime about a parent node having a time value <= to its child, which occured because we tried to add set of census nodes at time 3, but node 7 was already at time 3, giving a zero branch length (which is disallowed by tskit).

The solution is to use non-integer time values when performing a census under the DTWF:

demography = msprime.Demography()
demography.add_population(name="A", initial_size=10)
demography.add_census(time=3.5)
ts = msprime.sim_ancestry({"A": 3}, demography=demography, model="dtwf", random_seed=33)
SVG(ts.draw_svg(y_axis=True, time_scale="log_time"))
_images/ancestry_80_0.svg

Manipulating simulation time

Stopping simulations early

In most simulations we want to simulate a complete history for our samples: that is, to when we have a common ancestor at every point along the sequence. For example, here we generate a complete ancestral history for two diploid samples:

ts = msprime.sim_ancestry(2, sequence_length=10, recombination_rate=0.1, random_seed=42)
SVG(ts.draw_svg(y_axis=True, y_ticks=[0, 1, 2, 3, 4, 5]))
_images/ancestry_82_0.svg

Sometimes we would like to stop the simulation early, before complete coalescence has occured; perhaps we would like to combine several simulations with different properties, or we are really only interested in the recent past and simulating more ancient processes would be a waste of computational resources.

We can do this by specifying a end_time value. Let’s repeat the same simulation as above with an end_time of 2:

ts = msprime.sim_ancestry(
    2, sequence_length=10, recombination_rate=0.1, end_time=2, random_seed=42)
SVG(ts.draw_svg(y_axis=True, y_ticks=[0, 1, 2]))
_images/ancestry_84_0.svg

There are a number of important things to observe about this tree sequence:

  1. The history up until time 2 is identical to the original simulation.

  2. The first tree has not fully coalesced, and we therefore have multiple roots. Trees with multiple roots are fully supported in tskit.

  3. The nodes 8 and 9 are at time 2 (our end_time value), and these connect to the uncoalesced portions of the tree. These “unary” nodes are very important for ensuring the correct statistical properties when we combine multiple simulations.

See also

The ancestry model duration can also be used to stop simulations before coalescence.

For discrete time models, the exact interpretation of the end_time is important — we continue to simulate the process while the time is <= end_time. For example here we use an end_time of 1 generation in a Discrete Time Wright-Fisher simulation:

ts = msprime.sim_ancestry(3, population_size=10, model="dtwf", end_time=1, random_seed=42)
SVG(ts.draw_svg(y_axis=True))
_images/ancestry_86_0.svg

We have run 1 generation of the DTWF, resulting in the coalescence at node 9.

Setting the start time

Important

Setting simulation start time is an advanced feature which is rarely needed in practice.

Sometimes we need control when the simulation process starts. For example, given the initial configuration of at time zero we want to “jump ahead” to a later time point before we start generating random events. Consider the following example:

ts = msprime.sim_ancestry(2, random_seed=42)
ts.tables.nodes
idflagspopulationindividualtimemetadata
01000.00000000b''
11000.00000000b''
21010.00000000b''
31010.00000000b''
400-10.15642269b''
500-10.29152170b''
600-12.11740681b''

Now we run the same simulation with a start_time of 10:

ts = msprime.sim_ancestry(2, start_time=10, random_seed=42)
ts.tables.nodes
idflagspopulationindividualtimemetadata
01000.00000000b''
11000.00000000b''
21010.00000000b''
31010.00000000b''
400-110.15642269b''
500-110.29152170b''
600-112.11740681b''

We can see that the results are identical except that 10 has been added to all of the internal node times.

Important

It is important to note that the start_time only controls the “random” events that happen within the simulation model: events that are specified to occur at a given time (such as sampling or demographic events) will occur at that time, even if this is before the start_time.

Consider the following example in which we create a simple population tree demography, and then take samples at different times all before start_time:

demography = msprime.Demography()
demography.add_population(name="A", initial_size=100)
demography.add_population(name="B", initial_size=100)
demography.add_population(name="C", initial_size=100)
demography.add_population_split(time=5, derived=["A", "B"], ancestral="C")

ts = msprime.sim_ancestry(
    samples=[
        msprime.SampleSet(1, population="A", time=1),
        msprime.SampleSet(1, population="B", time=2),
        msprime.SampleSet(1, population="C", time=6),
    ],
    demography=demography,
    ploidy=1,
    start_time=10,
    random_seed=1)

ts.tables.nodes
idflagspopulationindividualtimemetadata
01001.00000000b''
11112.00000000b''
21226.00000000b''
302-152.47084127b''
402-166.18169933b''

We can see that the samples we specify are in the correct populations and their node times are at the sampling time. Note that all the coalescences happen in population ID 2 (“C”), indicating that the lineages from “A” and “B” moved there at the time of the split (which we can observe directly using the record_migrations option if we wish). Note also that we drew a sample from population “C” at time 6, which is after the split; if we tried to draw a sample before the split, we would get an error because the population would be inactive.

See also

See the Order of event execution section for details of the precise order in which fixed-time events occur.

Note

The same considerations apply when we are running simulations based on a given initial state; in this case, the start_time is by default set to the time of the youngest root in the input trees.

Order of event execution

There are two different classes of event that happen in ancestry simulations: “fixed” and “random” events. Random events are generated by the ancestry model, and fixed events are those that occur at a time specified by the user. Fixed events are demographic events or sampling events. If a number of different events are specified to occur at exactly the same time, then:

  1. All demographic events at this time are run, in the order they were specified.

  2. Sampling events occur immediately after all the demographic events have been executed. (See the Sampling time section for more details on ancient sampling events.)

These details are slightly different for the Discrete Time Wright-Fisher model. See the API documentation for details.

Specifying the initial state

By default msprime simulations are initialised by specifying a set of samples, which sets up the simulation with segments of ancestral material covering the whole sequence. Internally, this initial state is implemented by setting up a tskit.TableCollection which contains the population, node, and individual information. The low-level simulation code then reads these tables and sets up the simulation accordingly. We can see what this initial state looks like by setting up a simple simulation and stopping it immediately:

initial_ts = msprime.sim_ancestry(1, end_time=0)
initial_ts
Tree Sequence
Trees1
Sequence Length1.0
Sample Nodes2
Total Size1.3 KiB
MetadataNo Metadata
Table Rows Size Has Metadata
Edges 0 4 Bytes
Individuals 1 28 Bytes
Migrations 0 4 Bytes
Mutations 0 8 Bytes
Nodes 2 52 Bytes
Populations 1 216 Bytes
Provenances 1 961 Bytes
Sites 0 8 Bytes

We can see that the returned tree sequence has one population, two sample nodes and one individual (since sample individuals are diploid by default). There are no edges because the simulation didn’t run for any period of time: we’re essentially just returning the initial state used to start the simulation.

Then, running a simulation in which the initial_state is equal to initial_ts is precisely the same as running the simulation with the same arguments as we used above:

ts1 = msprime.sim_ancestry(initial_state=initial_ts, population_size=100, random_seed=2)
ts2 = msprime.sim_ancestry(1, population_size=100, random_seed=2)
assert ts1.equals(ts2, ignore_provenance=True)

Warning

Note that the initial_state tree sequence only contains the information about the ancestral histories, and not any of the simulation parameters. In the previous example we used default parameters to keep things simple, but in general the full simulation model will need to be specified as well as the initial state. In particular, see the Interaction with demography section.

While it is possible to specify the initial state manually by constructing the input tree sequence, it is an advanced topic and rarely necessary. More often we are interested in specifying the initial state of an msprime simulation using the output of another simulation.

Note

The initial_state tree sequence is quite literally the initial state of the tree sequence that is returned at the end of a simulation, and that any information present (including metadata) will not be altered.

Combining backward-time simulations

Suppose that we wished to simulate a scenario in which we have recombination happening at some rate in the recent past, and in the more distant past there is no recombination. We can do this by combining the simulations, using one as the initial_state for the other.

We first run the simulation of the recent past:

ts1 = msprime.sim_ancestry(
    2, recombination_rate=0.01, sequence_length=10, end_time=10,
    population_size=100, random_seed=1)
SVG(ts1.draw_svg(y_axis=True, time_scale="log_time", y_ticks=[0, 1, 2, 5, 10]))
_images/ancestry_98_0.svg

Some recombination and coalescence has happened and we have three marginal trees along the genome. None of these trees has fully coalesced, and so each of the trees has multiple roots.

Important

The set of nodes at time 10 connecting to all the internal nodes in the trees are very important: without these nodes and unary edges, the statistical properties of the combined simulations would not be correct! (See also section below on forward simulations.)

We then run the next phase of the simulation in which there is no recombination by passing ts1 as the argument to initial_state:

ts2 = msprime.sim_ancestry(initial_state=ts1, population_size=100, random_seed=3)
SVG(ts2.draw_svg(y_axis=True, time_scale="log_time", y_ticks=[0, 10, 20, 50, 100, 200, 500]))
_images/ancestry_100_0.svg

Since there is no recombination in this more ancient simulation, node 12 is the MRCA in all three trees. The unary nodes 6, 7, 8, and 9 marking the transition between the two simulation regimes are still present in the trees. If you wish to remove these, we can use the tskit.TreeSequence.simplify() method:

ts_simplified = ts2.simplify()
SVG(ts_simplified.draw_svg())
_images/ancestry_102_0.svg

Interaction with demography

In the previous example we saw how to combine two single population simulations. We can also combine more complex simulations involving demography, but care needs to be taken to ensure that the correct models are simulated in each part of the simulation. Suppose we have simple model with a population split:

demography = msprime.Demography()
demography.add_population(name="A", description="Contemporary population A", initial_size=100)
demography.add_population(name="B", description="Contemporary population B", initial_size=200)
demography.add_population(name="C", description="Ancestral population", initial_size=300)
demography.add_population_split(time=100, ancestral="C", derived=["A", "B"])
demography
Populations (3)
idnamedescriptioninitial_sizegrowth_ratedefault_sampling_timeextra_metadata
0AContemporary population A100.000{}
1BContemporary population B200.000{}
2CAncestral population300.001e+02{}
Migration matrix (all zero)
Events (1)
timetypeparameterseffect
100Population Splitderived=[A, B], ancestral=CMoves all lineages from derived populations 'A' and 'B' to the ancestral 'C' population. Also set the derived populations to inactive, and all migration rates to and from the derived populations to zero.

Then, we run a simulation of this model for 50 generations:

ts1 = msprime.sim_ancestry(
    samples={"A": 1, "B": 1}, demography=demography, end_time=50, random_seed=1234)
node_labels = {
    node.id: f"{node.id}:{ts1.population(node.population).metadata['name']}"
    for node in ts1.nodes()}
SVG(ts1.draw_svg(node_labels=node_labels, y_axis=True))
_images/ancestry_106_0.svg

After 50 generations we can see there has been a coalescence in population A but none in B and we therefore have three lineages left when the simulation finishes. We can then continue the simulation based on this initial state:

ts2 = msprime.sim_ancestry(
    initial_state=ts1, demography=demography, random_seed=5678)
node_labels = {
    node.id: f"{node.id}:{ts2.population(node.population).metadata['name']}"
    for node in ts2.nodes()}
SVG(ts2.draw_svg(node_labels=node_labels, y_axis=True))
_images/ancestry_108_0.svg

Note that we use the same demography object, and so the lineages migrate and ultimately coalesce in population C, as we’d expect. This is the simplest case, in which we already have the demography object which we can use to model the entire simulation from end-to-end.

In other cases (e.g. when working with simulations from a different program), we don’t have the Demography object at hand and we need to create one that is both compatible with the initial_state tree sequence and reflects the demographic model that we are interested in using. The best way to do this is to use the Demography.from_tree_sequence() method to first get a base demography that is based on the tree sequence population table and metadata:

demography = msprime.Demography.from_tree_sequence(ts1)
demography
Populations (3)
idnamedescriptioninitial_sizegrowth_ratedefault_sampling_timeextra_metadata
0AContemporary population A0.000{}
1BContemporary population B0.000{}
2CAncestral population0.000{}
Migration matrix (all zero)
Events (0)

We can see that the population names and descriptions have been recovered from the tree sequence metadata but no other information: the population sizes are all zero and we no longer have a population split event. If we try to run a simulation using this demography we get an error:

ts = msprime.sim_ancestry(initial_state=ts1, demography=demography)
---------------------------------------------------------------------------
InputError                                Traceback (most recent call last)
/tmp/ipykernel_2837/2085812628.py in <module>
----> 1 ts = msprime.sim_ancestry(initial_state=ts1, demography=demography)

~/work/tskit-site/tskit-site/msprime/ancestry.py in sim_ancestry(samples, demography, sequence_length, discrete_genome, recombination_rate, gene_conversion_rate, gene_conversion_tract_length, population_size, ploidy, model, initial_state, start_time, end_time, record_migrations, record_full_arg, num_labels, random_seed, num_replicates, replicate_index, record_provenance)
   1140         frame = inspect.currentframe()
   1141         provenance_dict = _build_provenance("sim_ancestry", random_seed, frame)
-> 1142     sim = _parse_sim_ancestry(
   1143         samples=samples,
   1144         sequence_length=sequence_length,

~/work/tskit-site/tskit-site/msprime/ancestry.py in _parse_sim_ancestry(samples, sequence_length, recombination_rate, gene_conversion_rate, gene_conversion_tract_length, discrete_genome, population_size, demography, ploidy, model, initial_state, start_time, end_time, record_migrations, record_full_arg, num_labels, random_seed, init_for_debugger)
    964     random_generator = _msprime.RandomGenerator(random_seed)
    965 
--> 966     return Simulator(
    967         tables=initial_state,
    968         recombination_map=recombination_map,

~/work/tskit-site/tskit-site/msprime/ancestry.py in __init__(self, tables, recombination_map, gene_conversion_map, gene_conversion_tract_length, discrete_genome, ploidy, demography, random_generator, models, store_migrations, store_full_arg, start_time, end_time, num_labels)
   1257 
   1258         start_time = -1 if start_time is None else start_time
-> 1259         super().__init__(
   1260             tables=ll_tables,
   1261             recombination_map=ll_recomb_map,

InputError: Input error in initialise: Attempt to sample lineage in a population with size=0

To recover the original model we must update the Population objects in place and add the population split event:

demography["A"].initial_size = 100
demography["B"].initial_size = 200
demography["C"].initial_size = 300
demography.add_population_split(time=100, ancestral="C", derived=["A", "B"])
demography
Populations (3)
idnamedescriptioninitial_sizegrowth_ratedefault_sampling_timeextra_metadata
0AContemporary population A100.000{}
1BContemporary population B200.000{}
2CAncestral population300.001e+02{}
Migration matrix (all zero)
Events (1)
timetypeparameterseffect
100Population Splitderived=[A, B], ancestral=CMoves all lineages from derived populations 'A' and 'B' to the ancestral 'C' population. Also set the derived populations to inactive, and all migration rates to and from the derived populations to zero.

Note

See the Demography documentation for details on how to access Population objects.

The demography that we obtain from Demography.from_tree_sequence() is really just a template, which we update and elaborate as we need. In particular, we are free to add more populations as needed.

Continuing forwards-time simulations (“recapitating”)

The most common use for the initial_state argument is to start the simulation from the output of a forwards-time simulation. Informally, we take an ‘unfinished’ tree sequence as a parameter to sim_ancestry(), initialise the simulation from the state of this tree sequence and then run the simulation until coalescence. The returned tree sequence is then the result of taking the input tree sequence and completing the trees using the coalescent.

This is useful for forwards-time simulators such as SLiM and fwdpy11 that can output tree sequences. By running forward-time simulation for a certain number of generations we obtain a tree sequence, but these trees may not have had sufficient time to reach a most recent common ancestor. By using the initial_state argument to sim_ancestry() we can combine the best of both forwards- and backwards-time simulators. The recent past can be simulated forwards in time and the ancient past by the coalescent. The coalescent simulation is initialised by the root segments of the input tree sequence, ensuring that the minimal amount of ancestral material possible is simulated.

Any tree sequence can be provided as input to this process, but there is a specific topological requirement that must be met for the simulations to be statistically correct. To ensure that ancestral segments are correctly associated within chromosomes when constructing the initial conditions for the coalescent simulation, forward-time simulators must retain the nodes corresponding to the initial generation. Furthermore, for every sample in the final generation (i.e. the extant population at the present time) there must be a path to one of the founder population nodes.

See also

See the recapitation tutorial from the pyslim documentation for a detailed explanation of this process when using SLiM.

Models

The ancestry model determines the model under which the ancestral history of the sample is generated. It is concerned with the the basic processes of how (for example) common ancestor and recombination events occur. For example, the default “standard” or Hudson coalescent is an efficient continuous time model, and the Discrete Time Wright-Fisher model allows for more detailed (if less efficient) simulations by making fewer mathematical approximations. We can combine multiple ancestry models in a simulation using a flexible notation.

See also

See the Quick reference for a summary of the available ancestry models

Important

The ancestry model is distinct from the demographic model, which is mostly independent of the chosen ancestry model.

Specifying ancestry models

The ancestry models used during a simulation are specified using the model parameter to sim_ancestry(). Each model is a subclass of the AncestryModel class (see the following subsections for the available models, and examples of their usage).

Single models

Most of the time we want to run a simulation under a single ancestry model. By default, we run simulations under the StandardCoalescent model. If we wish to run under a different model, we use the model argument to sim_ancestry(). For example, here we use the SMC model instead of the standard coalescent:

ts1 = msprime.sim_ancestry(
    10,
    sequence_length=10,
    recombination_rate=0.1,
    model=msprime.SmcApproxCoalescent(),
    random_seed=1234)

We specify the model as an instance of one of the ancestry model classes. For nonparametric models, there is also a useful shorthand of the model name:

ts2 = msprime.sim_ancestry(
    10,
    sequence_length=10,
    recombination_rate=0.1,
    model="smc",
    random_seed=1234)
assert ts1.equals(ts2, ignore_provenance=True)

This string form is useful for single model simulations; however, when working with Multiple models it is better to use the explicit class form so that the Model duration can be set.

Model duration

Each ancestry model instance has a duration associated with it. This is the maximum amount of time that this model can run for. Thus, if we wanted to run 10 generations of the DiscreteTimeWrightFisher model we could write:

ts = msprime.sim_ancestry(
    3,
    population_size=10,
    model=msprime.DiscreteTimeWrightFisher(duration=10),
    random_seed=1234)
SVG(ts.draw_svg(y_axis=True))
_images/ancestry_120_0.svg

Note

Using the duration value for a single model like this is identical to specifying an end_time value: see the Stopping simulations early section for more information.

It is vital to understand that the duration value here is not an absolute time at which the simulation must stop, but rather the number of generations that the model should run for. To illustrate this we can use the start_time parameter:

ts = msprime.sim_ancestry(
    3,
    population_size=10,
    start_time=5,
    model=msprime.DiscreteTimeWrightFisher(duration=10),
    random_seed=1234)
SVG(ts.draw_svg(y_axis=True))
_images/ancestry_122_0.svg

We can see that the simulation continued until 15 generations ago, because it didn’t start until 5 generations ago and we then specified that the model should run for at most 10 generations.

The model activity time is specified as a duration in this way as this is a natural way to specify Multiple models, in which some of the models may run for random periods of time.

Model completion

Most ancestry models will run until the overall simulation has completed; that is, when we have simulated back to the most recent common ancestor at every position along the sequence. For example, the StandardCoalescent and DiscreteTimeWrightFisher models will simulate until coalescence, unless we explicitly indicate that we wish to stop the simulation early via the model duration or the end_time parameter.

Models such as SweepGenicSelection are different though: they are only defined for a limited amount of time and may not run until coalescence (see the Selective sweeps for more information on the model itself).

sweep = msprime.SweepGenicSelection(
    position=5,
    start_frequency=0.1,
    end_frequency=0.9,
    s=0.25,
    dt=1e-6,
)
ts = msprime.sim_ancestry(
    3,
    sequence_length=10,
    population_size=100,
    model=sweep,
    random_seed=1234)
SVG(ts.draw_svg(y_axis=True))
_images/ancestry_124_0.svg

Here, we see that even though we didn’t specify a model duration or end_time parameter the simulation has ended before we reached coalescence. In addition, the model duration is random:

ts = msprime.sim_ancestry(
    3,
    sequence_length=10,
    population_size=100,
    model=sweep,
    random_seed=12345)
SVG(ts.draw_svg(y_axis=True))
_images/ancestry_126_0.svg

Here we ran the same simulation with a different random seed, and we can see that the time the model completed was different in the two cases (30.22 vs 28.15 generations ago).

Note

Models such as SweepGenicSelection can also complete due to full coalescence.

Multiple models

Sometimes we are interested in combining different models in a simulation. To do this we provide a list of ancestry model instances as the model argument to sim_ancestry(). The models are run in the order they are specified. For each model, if its duration is set, then the simulation will run for at most that additional time duration. If the duration is not set (or None), then the model will run until completion.

For example, here we run a hybrid DTWF and coalescent simulation (see the Discrete Time Wright-Fisher section for more details):

ts = msprime.sim_ancestry(
    2,
    population_size=1000,
    model=[
        msprime.DiscreteTimeWrightFisher(duration=500),
        msprime.StandardCoalescent(),
    ],
    random_seed=2
)

Warning

It is very imporant to remember to set a duration value in the first model here! Otherwise, the entire simulation will run under the DTWF model.

In this example we’ve run the DiscreteTimeWrightFisher model for the first 500 generations, and then switched to the StandardCoalescent. Because we did not specify a duration in the second model, it will run until coalescence.

Note

Switching models at a fixed time point like this is equivalent to first running the DTWF phase of the simulation with end_time=500 and then using the output as the initial state for the simulation of the Hudson phase. The model switching syntax here is a little more convenient and efficient, however.

See also

See the Multiple sweeps section for an example of running many models with random durations.

Tip

The logging output of msprime can be very useful when working with multiple models. See the Logging section for more details.

Hudson coalescent

The standard coalescent is the default model in msprime. The algorithm for simulating the coalescent with recombination was developed by Hudson (1983), and so we refer to it as the “Hudson” coalescent.

Running a simulation without specifying a model is the same as running with model="hudson":

ts1 = msprime.sim_ancestry(5, random_seed=2)
ts2 = msprime.sim_ancestry(5, model="hudson", random_seed=2)
# This is the same simulation so tree sequences are equal
assert ts1.equals(ts2, ignore_provenance=True)

Time is measured in units of generations ago. This is a continuous time model, in which the simulation moves event-by-event backwards in time (contrasting with the Discrete Time Wright-Fisher model, which works generation-by-generation). Thus, we can have fractional generations, as in this example where the MRCA of the sample occurs 10.59 “generations” ago:

ts = msprime.sim_ancestry(3, random_seed=234)
SVG(ts.draw_svg(y_axis=True))
_images/ancestry_132_0.svg

This occurs because time is scaled to be proportional to the population size; if we run the same simulation with a different population size, we can see that the time values are simply scaled up accordingly (the default population_size is 1):

ts = msprime.sim_ancestry(3, population_size=100, random_seed=234)
SVG(ts.draw_svg(y_axis=True))
_images/ancestry_134_0.svg

See also

The ploidy parameter also controls the time scale; the default here is the diploid time scale where time is measured in units of 4N generations.

SMC approximations

The SMC and SMC' are approximations of the continuous time Hudson coalescent model. These were originally motivated largely by the need to simulate coalescent processes more efficiently than was possible using the software available at the time; however, improved algorithms mean that such approximations are now mostly unnecessary for simulations.

The SMC and SMC’ are however very important for inference, as the approximations have made many analytical advances possible.

Since the SMC approximations are not required for simulation efficiency, these models are implemented using a naive rejection sampling approach in msprime. The implementation is intended to facilitate the study of the SMC approximations, rather than to be used in a general-purpose way.

Discrete Time Wright-Fisher

Msprime provides the option to perform discrete-time Wright-Fisher (DTWF) simulations for scenarios when the coalescent model is not appropriate, including large sample sizes, multiple chromosomes, or recent migration. Please see the API documentation for a formal description of the model, and Nelson et al. 2020 for more details and background information.

To run DTWF simulations, we need to provide the model="dtwf" argument to sim_ancestry():

ts = msprime.sim_ancestry(4, population_size=10, model="dtwf", random_seed=7)
SVG(ts.draw_svg(y_axis=True, tree_height_scale="log_time"))
/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/tskit/drawing.py:834: FutureWarning: tree_height_scale is deprecated; use time_scale instead
  warnings.warn(
_images/ancestry_136_1.svg

There are a few important points to note here:

  1. All node times are integers (this is discrete time counted in generations);

  2. We can have non-binary nodes (see node 10);

  3. We can have simultaneous events (see nodes 8 and 9).

Most features can be used with the DTWF model, and an error will be raised if you attempt to use a feature that is not implemented for DTWF. (Please let us know if this feature is important to you!)

Important

For DTWF simulations with population structure, each row of the migration matrix must sum to one or less.

Because DTWF simulations trace the history of a sample by going backwards in time generation-by-generation rather than event-by-event like the continuous time models, it is significantly slower to simulate than the standard coalescent.

“Hybrid” simulations can be a good solution if DTWF simulations are slow. In this case we can use DTWF simulations for the recent past, and using the standard coalescent for the more distant past (where it is an excellent approximation of the discrete time model). Hybrid simulations of this type provide the best of computational efficiency and accuracy.

See also

See Nelson et al. 2020 for more details and examples where the DTWF model is more realistic than the coalescent, and an assessment of the accuracy of hybrid simulations.

For example, here we simulate with the DTWF model for 500 generations, before switching to the standard (Hudson) coalescent:

ts = msprime.sim_ancestry(
    2,
    population_size=1000,
    model=[
        msprime.DiscreteTimeWrightFisher(duration=500),
        msprime.StandardCoalescent(),
    ],
    random_seed=4
)
SVG(ts.draw_svg(y_axis=True, tree_height_scale="log_time"))
_images/ancestry_138_0.svg

Because of the integer node times, we can see here that most of the coalescent happened during the Wright-Fisher phase of the simulation, and as-of 500 generations in the past, there were only two lineages left. The continuous time standard coalescent model was then used to simulate the ancient past of these two lineages.

See the Specifying ancestry models section for more details on how the use the model argument to sim_ancestry() to run multiple models.

See also

See the Multiple chromosomes section for information on how to simulate multiple chromosomes using the DTWF.

Multiple merger coalescents

Some evolutionary scenarios, such as a skewed offspring distribution combined with a type III survivorship curve, range expansion, and rapid adaptation, can predict genealogies with up to four simultaneous multiple mergers. Msprime provides the option to simulate from two classes of such genealogical processes: the BetaCoalescent and the DiracCoalescent. Please see the API documentation for formal details of the models.

For haploid organisms, both models result in genealogies in which any number of lineages can merge into a common ancestor, but only one merger event can take place at a given time. For \(p\)-ploids, up to \(2p\) simultaneous mergers can take place, corresponding to the \(2p\) available parental chromosome copies.

The diploid Beta-Xi-coalescent can be simulated as follows:

ts = msprime.sim_ancestry(
    samples=5, ploidy=2, random_seed=1,
    model=msprime.BetaCoalescent(alpha=1.001))
SVG(ts.draw_svg())
_images/ancestry_140_0.svg

The specified value of \(\alpha = 1.001\) corresponds to a heavily skewed offspring distribution. Values closer to \(\alpha = 2\) result in trees whose distribution is closer to that of the standard coalescent, often featuring no multiple mergers for small sample sizes:

ts = msprime.sim_ancestry(
    samples=4, ploidy=2, random_seed=1,
    model=msprime.BetaCoalescent(alpha=1.8))
SVG(ts.draw_svg())
_images/ancestry_142_0.svg

Multiple mergers still take place in a haploid simulation, but only one merger can take place at a given time:

ts = msprime.sim_ancestry(
    samples=10, ploidy=1, random_seed=1,
    model=msprime.BetaCoalescent(alpha=1.001))
SVG(ts.draw_svg())
_images/ancestry_144_0.svg

A haploid simulation results in larger individual mergers than a polyploid simulation because large mergers typically get broken up into multiple simultaneous mergers in the polyploid model.

The number of generations between merger events in the Beta-coalescent depends nonlinearly on both \(\alpha\) and the population size \(N\) as detailed above. For a fixed \(\alpha\), the number of generations between common ancestor events is proportional to \(N^{\alpha - 1}\), albeit with a complicated constant of proportionality that depends on \(\alpha\). The dependence on \(\alpha\) for fixed \(N\) is not monotone. Thus, branch lengths and the number of generations until a most recent common ancestor depend on both of these parameters.

To illustrate, for \(\alpha\) close to 2 the relationship between effective population size and number of generations is almost linear:

ts = msprime.sim_ancestry(
    samples=1, ploidy=2, random_seed=1, population_size=10,
    model=msprime.BetaCoalescent(alpha=1.99))
tree = ts.first()
print(tree.tmrca(0,1))
ts = msprime.sim_ancestry(
    samples=1, ploidy=2, random_seed=1, population_size=1000,
    model=msprime.BetaCoalescent(alpha=1.99))
tree = ts.first()
print(tree.tmrca(0,1))
0.14959691919068155
14.286394871874865

For \(\alpha\) close to 1 the effective population size has little effect:

ts = msprime.sim_ancestry(
    samples=1, ploidy=2, random_seed=1, population_size=10,
    model=msprime.BetaCoalescent(alpha=1.1))
tree = ts.first()
print(tree.tmrca(0,1))
ts = msprime.sim_ancestry(
    samples=1, ploidy=2, random_seed=1, population_size=1000,
    model=msprime.BetaCoalescent(alpha=1.1))
tree = ts.first()
print(tree.tmrca(0,1))
16.311807036386615
25.85247192870844

The Dirac-coalescent is simulated similarly in both the diploid case:

ts = msprime.sim_ancestry(
    samples=5, ploidy=2, random_seed=1,
    model=msprime.DiracCoalescent(psi=0.9, c=10))
SVG(ts.draw_svg())
_images/ancestry_150_0.svg

and in the haploid case:

ts = msprime.sim_ancestry(
    samples=10, ploidy=1, random_seed=1,
    model=msprime.DiracCoalescent(psi=0.9, c=10))
SVG(ts.draw_svg())
_images/ancestry_152_0.svg

As with the Beta-coalescent, a haploid simulation results in larger individual mergers than a polyploid simulation because large mergers typically get broken up into multiple simultaneous mergers in the polyploid model. Larger values of the parameter \(c > 0\) result in more frequent multiple mergers, while larger values of \(0 < \psi \leq 1\) result in multiple mergers with more participating lineages. Setting either parameter to 0 would correspond to the standard coalescent.

The Dirac-coalescent is obtained as the infinite population scaling limit of Moran models, and therefore branch lengths are proportional to \(N^2\) generations, as opposed to \(N\) generations under the standard coalescent.

ts = msprime.sim_ancestry(
    samples=1, ploidy=2, random_seed=1, population_size=10,
    model=msprime.DiracCoalescent(psi=0.1, c=1))
tree = ts.first()
print(tree.tmrca(0,1))
ts = msprime.sim_ancestry(
    samples=1, ploidy=2, random_seed=1, population_size=100,
    model=msprime.DiracCoalescent(psi=0.1, c=1))
tree = ts.first()
print(tree.tmrca(0,1))
64.7216732891741
6472.167328917411

Selective sweeps

Although the coalescent assumes that lineages are exchangable, (and therefore evolving neutrally), approximations to some forms of selective sweep (beneficial mutation moves through the population), have been derived. This is done through the use of a structured coalescent model in the spirit of Braverman et al. (1995).

In the SweepGenicSelection model, the population is split between two classes of selective backgrounds, those linked to the beneficial allele, call it \(B\), and those not, \(b\). In this model we track competing rates of coalescence and recombination on the \(B\) and \(b\) backgrounds. The user supplies a final allele frequency and a starting allele frequency, between which msprime simulates a stochastic sweep trajectory according to a conditional diffusion model (Coop and Griffiths, 2004; Kim and Stephan 2002).

Beyond the start and end frequencies of the sweep trajectory, the user must specify the selection coefficient of the beneficial mutation \(s\) the selective advantage of the \(B\) homozygote over the \(b\) homozygote and \(h=0.5\). The position represents the location along the chromosome where the beneficial allele occurs. All other parameters can be set as usual.

Warning

The implementation of the structured coalescent is quite general, but there are some limitations in the current implementation of the sweeps model (e.g., no change of population size during a sweep). Please let us know if there are related features you would like to see (or if you would be interested in helping to create this functionality!)

Hard sweeps

Todo

What’s a hard sweep? Some intro sentences here.

In this example we run some replicate hard sweep simulations and plot the mean pairwise diversity in windows across the simulated region, showing the characteristic valley of diversity.

First we set up some basic parameters, and define the sweep model:

Ne = 1e3
L = 1e6  # Length of simulated region
num_reps = 100

# define hard sweep model
sweep_model = msprime.SweepGenicSelection(
    position=L / 2,  # middle of chrom
    start_frequency=1.0 / (2 * Ne),
    end_frequency=1.0 - (1.0 / (2 * Ne)),
    s=0.25,
    dt=1e-6,
)

Todo

Explain why these parameters give a hard sweep.

Next we set up the replicate simulations. As described in the Multiple models section, the model parameter is a list when we want to run multiple ancestry models. In this example, we have a sweep that occurs in the immediate past, and is then followed by the standard coalescent for the rest of time:

reps = msprime.sim_ancestry(
    5,
    model=[sweep_model, msprime.StandardCoalescent()],
    population_size=Ne,
    recombination_rate=1e-7,
    sequence_length=L,
    num_replicates=num_reps,
)

Note

Because the SweepGenicSelection model has a random duration we do not set this value in the model. Please see the Model completion section for more discussion on this important point.

Once we’ve set up the replicate simulations we can compute the windows for plotting, run the actual simulations (see the Running replicate simulations section for more details) and compute the pairwise diversity.

wins = np.linspace(0, L, 21)
mids = (wins[1:] + wins[:-1]) / 2
diversity = np.zeros((num_reps, mids.shape[0]))
for j, ts in enumerate(reps):
    diversity[j] = ts.diversity(windows=wins, mode="branch")

Finally, we can plot the observed mean diversity across the replicates and compare it to the neutral expectation:

from matplotlib import pyplot as plt

plt.plot(mids, diversity.mean(axis=0), label="Simulations")
plt.axhline(4 * Ne, linestyle=":", label=r'Neutral expectation')
plt.ylabel(r'Branch $\pi$');
plt.xlabel('Position (bp)')
plt.legend();
_images/ancestry_162_0.svg

Note

We use the “branch” measure of pairwise diversity which is defined in terms of the trees rather than nucleotide sequences. See Ralph et al. 2020 for more information.

As we can see, the selective sweep has reduced variation in the region most closely linked to the beneficial allele and then diversity increases with distance to each side.

Multiple sweeps

Todo

This example is very artificial; it would be much better to illustrate with a meaningful example from the literature. Pull requests welcome!

To illustrate the idea that we can simulate large numbers of different models, in this example we create a simulation in which we have sweeps happening at random points along the genome, which are separated by a random duration of the standard coalescent.

First we set up our models:

import random
random.seed(1234)
L = 100

models = []
for _ in range(100):
    models.append(msprime.StandardCoalescent(duration=random.uniform(0, 1)))
    models.append(
        msprime.SweepGenicSelection(
            position=random.randint(1, L - 1),
            start_frequency=0.4,
            end_frequency=0.5,
            s=0.01,
            dt=1e-6,
        )
    )
models.append(msprime.StandardCoalescent())
models[:5]
[StandardCoalescent(duration=0.9664535356921388),
 SweepGenicSelection(duration=None, position=57, start_frequency=0.4, end_frequency=0.5, s=0.01, dt=1e-06),
 StandardCoalescent(duration=0.11685051774599753),
 SweepGenicSelection(duration=None, position=12, start_frequency=0.4, end_frequency=0.5, s=0.01, dt=1e-06),
 StandardCoalescent(duration=0.9109759624491242)]

We’ve created a list of models that have random parameters using the standard Python tools. There are lots of sweeps (many more than we will actually run, in all probability), each at a random position on the genome and each separated by a random amount of time running the standard coalescent. Finally, we add an instance of the standard coalescent without a duration to guarantee that our simulation will always coalesce fully (See the Model completion section for more information.) Finally, we print out the first 5 models so we can inspect them.

Then, we can run our simulation as usual:

ts = msprime.sim_ancestry(
    10,
    population_size=100,
    sequence_length=L,
    recombination_rate=0.01,
    model=models,
    random_seed=6789
)
ts
Tree Sequence
Trees100
Sequence Length100.0
Sample Nodes20
Total Size95.0 KiB
MetadataNo Metadata
Table Rows Size Has Metadata
Edges 1626 44.5 KiB
Individuals 10 172 Bytes
Migrations 0 4 Bytes
Mutations 0 8 Bytes
Nodes 533 12.5 KiB
Populations 1 216 Bytes
Provenances 1 24.9 KiB
Sites 0 8 Bytes

Tip

The logging output of msprime can be very useful when working with multiple models. See the Logging section for more details.

Missing data

Todo

Give an example of doing a simulation with unknown rates on the flanks and show that the resulting tree sequence has missing data in these regions.

Common errors

Infinite waiting time

Todo

explain this, why it happens and give examples of when we don’t detect it. Mention the possible_lineage_locations method.

demography = msprime.Demography()
demography.add_population(name="A", initial_size=1)
demography.add_population(name="B", initial_size=1)
msprime.sim_ancestry(samples={"A": 1, "B": 1}, demography=demography)
---------------------------------------------------------------------------
LibraryError                              Traceback (most recent call last)
/tmp/ipykernel_2837/2514769940.py in <module>
      2 demography.add_population(name="A", initial_size=1)
      3 demography.add_population(name="B", initial_size=1)
----> 4 msprime.sim_ancestry(samples={"A": 1, "B": 1}, demography=demography)

~/work/tskit-site/tskit-site/msprime/ancestry.py in sim_ancestry(samples, demography, sequence_length, discrete_genome, recombination_rate, gene_conversion_rate, gene_conversion_tract_length, population_size, ploidy, model, initial_state, start_time, end_time, record_migrations, record_full_arg, num_labels, random_seed, num_replicates, replicate_index, record_provenance)
   1159         random_seed=random_seed,
   1160     )
-> 1161     return _wrap_replicates(
   1162         sim,
   1163         num_replicates=num_replicates,

~/work/tskit-site/tskit-site/msprime/ancestry.py in _wrap_replicates(simulator, num_replicates, replicate_index, provenance_dict, mutation_rate)
    666     )
    667     if replicate_index is not None:
--> 668         deque = collections.deque(iterator, maxlen=1)
    669         return deque.pop()
    670     else:

~/work/tskit-site/tskit-site/msprime/ancestry.py in run_replicates(self, num_replicates, mutation_rate, provenance_dict)
   1436         for replicate_index in range(num_replicates):
   1437             logger.info("Starting replicate %d", replicate_index)
-> 1438             self.run()
   1439             if mutation_rate is not None:
   1440                 # This is only called from simulate() or the ms interface,

~/work/tskit-site/tskit-site/msprime/ancestry.py in run(self, event_chunk, debug_func)
   1401                 raise ValueError("Model durations must be >= 0")
   1402             end_time = min(self.time + model_duration, self.end_time)
-> 1403             exit_reason = self._run_until(end_time, event_chunk, debug_func)
   1404             if exit_reason == ExitReason.COALESCENCE or self.time == self.end_time:
   1405                 logger.debug("Skipping remaining %d models", len(self.models) - j - 1)

~/work/tskit-site/tskit-site/msprime/ancestry.py in _run_until(self, end_time, event_chunk, debug_func)
   1366         ret = ExitReason.MAX_EVENTS
   1367         while ret == ExitReason.MAX_EVENTS:
-> 1368             ret = ExitReason(super().run(end_time, event_chunk))
   1369             if self.time > end_time:
   1370                 # Currently the Pedigree and Sweeps models are "non-reentrant"

LibraryError: Infinite waiting time until next simulation event.