Switching from other simulators#
See also
Msprime outputs results as a succinct tree sequence using the tskit library. Please see Getting started with tskit for information on how to use this data structure to compute statistics and (if necessary) export to other formats.
Notes for ms users#
Msprime began as an efficient reimplementation of the classical ms
program, and therefore largely follows the same underlying models
as ms
. If you wish to use msprime
as a direct replacement
for ms
, please use the mspms program. This
aims to be 100% compatible with ms
, and should be substantially
faster when simulating large regions.
However, simulations using mspms
will still be hampered by the
text output of ms
, which is very inefficient for large
simulations. For larger simulations the tskit
output produced by msprime
’s Python API will be many times
smaller and faster to process than text based formats.
The Python APIs for describing demographic models,
running ancestry and
mutation simulations are extensively documented.
There are a few basic differences to ms
that are worth keeping
in mind:
Time is measured in generations rather than coalescent units.
Population sizes are defined as absolute values, not scaled relative to Ne.
All rates are absolute rates per generation (i.e., not scaled)
Recombination rates are per unit of sequence length (not total rates along the genome)
Gene conversion rates are absolute, and do not depend on the recombination rate.
Mutations are at finite discrete sites by default, following the Jukes-Cantor nucleotide model (but infinite sites 0/1 mutations at continuous locations can be produced, if required).
Notes for discoal, msms and macs users#
Different coalescent simulators may use different scaling factors for
simulation parameters, making it confusing to users who would like to switch
from simulators such as discoal
(Kern,
2016),
msms
(Ewing,
2010),
and macs
(Chen, 2009)
to msprime
. Here, we compared a few parameters and arguments commonly
used in these simulators to mitigate this pain. As msprime
python API is
more efficient and expressive than mspms
interface, the following
comparisons is done between the sim_ancestry()
and
sim_mutations()
interface of msprime
, and the command-line options
of the other simulators.
To better compare them, let us define a few terms:
r
: the recombination rate per generation per base pair.u
: the mutation rate per generation per base pair.g
: time in generations before present.nsam
: sample size or the number of chromosomes simulated.N
: the effective population size (of diploids).L
: the length of the simulated chromosome in base pairs.s
: the selection coefficient iss
for genotype homozygous for the selected allele (AA
),hs
for (Aa
). Currently,h
is always 0.5 inmsprime
.selpos
: the position (in base pairs) of the site under selection.
Scaled parameters#
Msprime |
Discoal |
Msms |
Macs |
|
---|---|---|---|---|
Sample size |
nsam/2 |
nsam |
nsam |
nsam |
Recombination rate |
r |
4Nr(L-1) |
4Nr(L-1) |
4Nr |
mutation rate |
u |
4NuL |
4NuL |
4Nu |
Selection coefficient |
AA: s |
AA: 2Ns |
AA: 2Ns |
- |
Aa: 0.5s |
Aa: Ns |
Aa: 2Nhs |
- |
|
Selection position |
selpos |
selopos/L |
selpos/L |
- |
Time |
g |
g/(4N) |
g/(4N) |
g/(2N) |
Additional comments:
msprime
can directly usensam
as sample size parameter ifploidy
is specified to 1. See Ploidy section for detailed explanation aboutploidy
.Time in
macs
is also calculated as g/(4N), see Browning, 2015.
Example 1#
In Example 1, we simulate 100 samples with a sample size of 300 (chromosomes), a chromosome length of 1000,000 bp, recombination and mutation rates of 1e-8 per generation per base pair, and an effective population size of 1000.
Msprime#
import msprime
ts = msprime.sim_ancestry(
samples=150, # Default ploidy is 2, so 300 sampled chromosomes
population_size=1000,
recombination_rate=1e-8,
sequence_length=1e6,
discrete_genome=False,
model='Hudson',
)
mutated_ts = msprime.sim_mutations(
ts,
rate=1e-8,
discrete_genome=False,
model=msprime.BinaryMutationModel(),
)
Note
msprime
separates ancestry and mutation
simulations, although the two steps are usually run together in other
simulators. See the Quickstart section for an introduction to
simulating ancestry and mutations in msprime
.
See also
We run a single replicate simulation here. Please see the Randomness and replication section for more information on how to run replicate simulations in msprime.
Discoal#
# discoal {nsam} {num_repeats} {L} -t {4*N*u*L} -r {4*N*r*(L-1)}
$ discoal 300 100 1000000 -t 40 -r 40
Msms#
# java -jar msms.jar {nsam} {num_repeats} \
# -t {4*N*u*L} -r {4*N*r*(L-1)} {L}
$ java -Xmx1G -jar msms.jar 300 100 -t 40 -r 40 1000000
MaCS#
# macs {nsam} {L} -t {4*N*u} -r {4*N*r}
$ macs 300 1000000 -t 0.00004 -r 0.00004
Example 2#
In Example 2, we add a selective sweep at the midpoint of the chromosome
(position 500,000) ended at 80 generations ago. The end frequency of the
selected allele is 0.9; the selection coefficient is 0.2 for genotype AA
and
0.1 for Aa
. This type of simulation can be done in msprime
, discoal
and msms
.
Msprime#
import msprime
N = 1000 # effective population size
model_list = [
msprime.StandardCoalescent(duration=80), # From generation 0 to 80
msprime.SweepGenicSelection( # From generation 80 to
position=500000, # selpos # selection start time (random)
start_frequency=1 / (2 * N),
end_frequency=0.9,
s=0.2, # s for AA
dt=1.0 / (40 * N)
),
msprime.StandardCoalescent() # From selection start time to coalescence
]
ts = msprime.sim_ancestry(
samples=150, # Default ploidy is 2, so 300 sampled chromosomes
population_size=N,
recombination_rate=1e-8,
sequence_length=1e6,
discrete_genome=False,
model=model_list,
)
mutated_ts = msprime.sim_mutations(
ts,
rate=1e-8,
discrete_genome=False,
model=msprime.BinaryMutationModel(),
)
The SweepGenicSelection
class provides an interface to define the
selective sweep model. More examples can be found in the
Selective sweeps section.
Note
msprime
ancestry model
parameters can be specified as a single model
(Example 1), or a list of models (Example 2). Please
see the Models section for more details.
See also
We run a single replicate simulation here. Please see the Randomness and replication section for more information on how to run replicate simulations in msprime.
Discoal#
# discoal {nsam} {num_repeats} {L} -t {4*N*u*L} -r {4*N*r*(L-1)} \
# -ws {sel_end_time/(4N)} -a {s * 0.5} \
# -x {selpos/L} -c {end_frequency} -N {N}
$ discoal 300 100 1000000 -t 40 -r 40 \
-ws 0.02 -a 200 -x 0.5 -c 0.9 -N 1000
Msms#
# java -jar msms.jar {nsam} {num_repeats} -t {4*N*u*L} -r {4*N*r*(L-1)} {L} \
# -SaA {2*N*(s*0.5)} -SAA {2*N*s} \
# -SF {sel_end_time/(4N)} {end_frequency} \
# -Sp {selpos/L} -N {N}
$ java -Xmx1G -jar msms.jar 300 100 -t 40 -r 40 1000000 \
-SaA 200 -SAA 400 -SF 0.02 0.9 -Sp 0.5 -N 1000