Switching from other simulators#

See also

Msprime outputs results as a succinct tree sequence using the tskit library. Please see Getting started with tskit for information on how to use this data structure to compute statistics and (if necessary) export to other formats.

Notes for ms users#

Msprime began as an efficient reimplementation of the classical ms program, and therefore largely follows the same underlying models as ms. If you wish to use msprime as a direct replacement for ms, please use the mspms program. This aims to be 100% compatible with ms, and should be substantially faster when simulating large regions.

However, simulations using mspms will still be hampered by the text output of ms, which is very inefficient for large simulations. For larger simulations the tskit output produced by msprime’s Python API will be many times smaller and faster to process than text based formats.

The Python APIs for describing demographic models, running ancestry and mutation simulations are extensively documented. There are a few basic differences to ms that are worth keeping in mind:

  • Time is measured in generations rather than coalescent units.

  • Population sizes are defined as absolute values, not scaled relative to Ne.

  • All rates are absolute rates per generation (i.e., not scaled)

  • Recombination rates are per unit of sequence length (not total rates along the genome)

  • Gene conversion rates are absolute, and do not depend on the recombination rate.

  • Mutations are at finite discrete sites by default, following the Jukes-Cantor nucleotide model (but infinite sites 0/1 mutations at continuous locations can be produced, if required).

Notes for discoal, msms and macs users#

Different coalescent simulators may use different scaling factors for simulation parameters, making it confusing to users who would like to switch from simulators such as discoal (Kern, 2016), msms (Ewing, 2010), and macs (Chen, 2009) to msprime. Here, we compared a few parameters and arguments commonly used in these simulators to mitigate this pain. As msprime python API is more efficient and expressive than mspms interface, the following comparisons is done between the sim_ancestry() and sim_mutations() interface of msprime, and the command-line options of the other simulators.

To better compare them, let us define a few terms:

  • r: the recombination rate per generation per base pair.

  • u: the mutation rate per generation per base pair.

  • g: time in generations before present.

  • nsam: sample size or the number of chromosomes simulated.

  • N: the effective population size (of diploids).

  • L: the length of the simulated chromosome in base pairs.

  • s: the selection coefficient is s for genotype homozygous for the selected allele (AA), hs for (Aa). Currently, h is always 0.5 in msprime.

  • selpos: the position (in base pairs) of the site under selection.

Scaled parameters#

Msprime

Discoal

Msms

Macs

Sample size

nsam/2

nsam

nsam

nsam

Recombination rate

r

4Nr(L-1)

4Nr(L-1)

4Nr

mutation rate

u

4NuL

4NuL

4Nu

Selection coefficient

AA: s

AA: 2Ns

AA: 2Ns

-

Aa: 0.5s

Aa: Ns

Aa: 2Nhs

-

Selection position

selpos

selopos/L

selpos/L

-

Time

g

g/(4N)

g/(4N)

g/(2N)

Additional comments:

  • msprime can directly use nsam as sample size parameter if ploidy is specified to 1. See Ploidy section for detailed explanation about ploidy.

  • Time in macs is also calculated as g/(4N), see Browning, 2015.

Example 1#

In Example 1, we simulate 100 samples with a sample size of 300 (chromosomes), a chromosome length of 1000,000 bp, recombination and mutation rates of 1e-8 per generation per base pair, and an effective population size of 1000.

Msprime#

import msprime

ts = msprime.sim_ancestry(
    samples=150,  # Default ploidy is 2, so 300 sampled chromosomes
    population_size=1000,
    recombination_rate=1e-8,
    sequence_length=1e6,
    discrete_genome=False,
    model='Hudson',
)
mutated_ts = msprime.sim_mutations(
    ts,
    rate=1e-8,
    discrete_genome=False,
    model=msprime.BinaryMutationModel(),
)

Note

msprime separates ancestry and mutation simulations, although the two steps are usually run together in other simulators. See the Quickstart section for an introduction to simulating ancestry and mutations in msprime.

See also

We run a single replicate simulation here. Please see the Randomness and replication section for more information on how to run replicate simulations in msprime.

Discoal#

# discoal {nsam} {num_repeats} {L} -t {4*N*u*L} -r {4*N*r*(L-1)}

$ discoal 300 100 1000000 -t 40 -r 40

Msms#

# java -jar msms.jar {nsam} {num_repeats} \
#   -t {4*N*u*L} -r {4*N*r*(L-1)} {L}

$ java -Xmx1G -jar msms.jar 300 100 -t 40 -r 40 1000000

MaCS#

# macs {nsam} {L} -t {4*N*u} -r {4*N*r}

$ macs 300 1000000 -t 0.00004 -r 0.00004

Example 2#

In Example 2, we add a selective sweep at the midpoint of the chromosome (position 500,000) ended at 80 generations ago. The end frequency of the selected allele is 0.9; the selection coefficient is 0.2 for genotype AA and 0.1 for Aa. This type of simulation can be done in msprime, discoal and msms.

Msprime#

import msprime

N = 1000  # effective population size

model_list = [
    msprime.StandardCoalescent(duration=80),  # From generation 0 to 80
    msprime.SweepGenicSelection(              # From generation 80 to
        position=500000,  # selpos            # selection start time (random)
        start_frequency=1 / (2 * N),
        end_frequency=0.9,
        s=0.2,  # s for AA
        dt=1.0 / (40 * N)
    ),
    msprime.StandardCoalescent()  # From selection start time to coalescence
]

ts = msprime.sim_ancestry(
    samples=150,  # Default ploidy is 2, so 300 sampled chromosomes
    population_size=N,
    recombination_rate=1e-8,
    sequence_length=1e6,
    discrete_genome=False,
    model=model_list,
)
mutated_ts = msprime.sim_mutations(
    ts,
    rate=1e-8,
    discrete_genome=False,
    model=msprime.BinaryMutationModel(),
)

The SweepGenicSelection class provides an interface to define the selective sweep model. More examples can be found in the Selective sweeps section.

Note

msprime ancestry model parameters can be specified as a single model (Example 1), or a list of models (Example 2). Please see the Models section for more details.

See also

We run a single replicate simulation here. Please see the Randomness and replication section for more information on how to run replicate simulations in msprime.

Discoal#

# discoal {nsam} {num_repeats} {L} -t {4*N*u*L} -r {4*N*r*(L-1)} \
#   -ws {sel_end_time/(4N)} -a {s * 0.5} \
#   -x {selpos/L} -c {end_frequency} -N {N}

$ discoal 300 100 1000000 -t 40 -r 40 \
   -ws 0.02 -a 200 -x 0.5 -c 0.9 -N 1000

Msms#

# java -jar msms.jar {nsam} {num_repeats} -t {4*N*u*L} -r {4*N*r*(L-1)} {L} \
#   -SaA {2*N*(s*0.5)} -SAA {2*N*s} \
#   -SF {sel_end_time/(4N)} {end_frequency} \
#   -Sp {selpos/L} -N {N}

$ java -Xmx1G -jar msms.jar 300 100 -t 40 -r 40 1000000 \
  -SaA 200 -SAA 400 -SF 0.02 0.9 -Sp 0.5 -N 1000