Legacy (version 0.x) APIs#
Msprime 1.0 involved major changes to the internal details of how simulations are performed and the addition of a new set of APIs, which solve some long-standing issues and provide more sensible defaults.
Short version:
Your old simulations will still work and the msprime 0.x APIs will be supported indefinitely. We’re not going to break your code. The only situation in which your simulation might break is if you’ve been doing some obscure things with genome discretisation. Please see this section for more information.
The new APIs are much better, and support several features that are not available for the legacy API. In general, new features will only be added to the 1.x APIs and 0.x legacy APIs are in maintenance mode.
Upgrading code#
This section is to help legacy 0.x users of msprime get up to speed quickly, summarising the new APIs and their main differences to the 0.x versions.
The main change is that there are two new functions, sim_ancestry()
and
sim_mutations()
which correspond to the 0.x functions simulate()
and mutate()
. The 0.x functions are deprecated but will continue
to be supported indefinitely.
One major change is that ancestry and mutations must be simulated separately. See the Adding mutations section for idiomatic examples of how to do this efficiently.
Ancestry#
The new sim_ancestry()
function replaces the 0.x simulate()
function and is very similar. See the Ancestry simulations page for
details and extensive examples using this function.
There are some important differences between simulate()
and sim_ancestry()
:
The
samples
parameter now refers to the number of individuals rather than the number of nodes (i.e. monoploid genomes). Because the default ploidy is 2 (see the next point) the upshot is thatsim_ancestry(2)
will result in a tree sequence with four sample nodes, not two. (It is possible to override this behaviour using the list ofSampleSet
objects parameter tosamples
.)The
Ne
parameter in 0.xsimulate()
function has been replaced with thepopulation_size
parameter.There is now a Ploidy parameter, which has two effects:
Sets the default number of sample nodes per individual
Changes the timescale over which coalescence occurs. By default
ploidy
is 2 and so mean time to common ancestor in a population of sizeN
is2N
generations, which is the same as msprime 0.x.
Rather than two parameters
num_samples
andsamples
, thesim_ancestry()
function has a single parametersamples
which has different behaviour depending on the type of parameters provided. See Specifying samples for details. Note in particular that a list ofSample
objects is not supported.Similarly, there is now one parameter
recombination_rate
which can be either a single value or aRateMap
object. Note that the 0.xRecombinationMap
is deprecated and not supported as input tosim_ancestry()
. See Recombination for more details.Simulations are performed on a discrete genome by default. To get the 0.x behaviour of a continuous genome, set
discrete_genome=False
. See the Discrete or continuous? section for more details.The
from_ts
parameter used has been renamed toinitial_state
and accepts either atskit.TableCollection
ortskit.TreeSequence
parameter. See the Specifying the initial state section for details.There is no
mutation_rate
parameter tosim_ancestry()
: usesim_mutations()
instead.The
population_configurations
,migration_matrix
anddemographic_events
parameters have been replace with a single parameterdemography
, which must take aDemography
instance. (See the next section for more details.)
Demography#
A new Demography
object has been added for version 1.0 which
encapsulates the functionality needed to define and debug demographic models
in msprime. Demographic models can only be specified to sim_ancestry
using an instance of this class.
See the Demographic models section for detailed documentation on this new interface.
It is easy to create a
Demography
from the 0.xpopulation_configurations
,migration_matrix
anddemographic_events
values using theDemography.from_old_style()
method.The
DemographyDebugger
class should no longer be instantiated directly; instead use theDemography.debug()
method.
Genome discretisation#
In msprime 0.x, recombination was implemented internally using a discrete number of genetic loci. That is, the simulation was performed in genetic coordinates, which were then mapped back to physical coordinates at the end of simulation. This had the significant advantage that recombination could be implemented during the simulation as a uniform process over these discrete loci. However, it led to many numerical issues encountered when mapping back and forth between genetic and physical coordinates as well as limiting what could be done in terms of gene conversion and other processes. We therefore changed to using physical coordinates throughout the simulation for msprime 1.0.
The simulations in 0.x and 1.x are almost entirely compatible and everything should work as expected when running 0.x code on msprime 1.0 or later. However, there is one (hopefully obscure) case in which code written for msprime 0.x will no longer work.
The num_loci
argument to the 0.x class RecombinationMap
was used to control the number of discrete genetic loci in the simulation. By
default, this was set to a large number (\(\sim 2^{32}\)), effectively
giving a continuous coordinate space when mapped back into physical units.
By setting the num_loci
equal
to the sequence length of the RecombinationMap, we could also specify
discrete physical loci. Specifying whether we simulate in discrete or continuous
genome coordinates is now done using the discrete_genome
argument
to sim_ancestry()
(see the Discrete or continuous?
section for more details). The RateMap
class is now used to
specify varying rates of recombination along the genome and no longer
has any concept of genetic “loci” — the choice of coordinate space
is now decoupled from our specification of the recombination process.
Both the cases of discrete and fully continuous genomes are well
supported in msprime 1.x and so nearly all existing code
should continue to work as expected.
What is no longer supported is specifying the “granularity” of the
continuous space via the num_loci
parameter, and if we try
to set num_loci
to anything other than the sequence length
we get an error:
import msprime
# Here we try to make a sequence length of 10 with 5 discrete loci
recomb_map = msprime.RecombinationMap(positions=[0, 10], rates=[0.1, 0], num_loci=5)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[1], line 4
1 import msprime
3 # Here we try to make a sequence length of 10 with 5 discrete loci
----> 4 recomb_map = msprime.RecombinationMap(positions=[0, 10], rates=[0.1, 0], num_loci=5)
File ~/work/tskit-site/tskit-site/msprime/intervals.py:649, in RecombinationMap.__init__(self, positions, rates, num_loci, map_start)
647 self._is_discrete = num_loci == positions[-1]
648 if num_loci is not None and num_loci != positions[-1]:
--> 649 raise ValueError(
650 "The RecombinationMap interface is deprecated and only "
651 "partially supported. If you wish to simulate a number of "
652 "discrete loci, you must set num_loci == the sequence length. "
653 "If you wish to simulate recombination process on as fine "
654 "a map as possible, please omit the num_loci parameter (or set "
655 "to None). Otherwise, num_loci is no longer supported and "
656 "the behaviour of msprime 0.x cannot be emulated. Please "
657 "consider upgrading your code to the version 1.x APIs."
658 )
659 self.map = RateMap(position=positions, rate=rates[:-1])
ValueError: The RecombinationMap interface is deprecated and only partially supported. If you wish to simulate a number of discrete loci, you must set num_loci == the sequence length. If you wish to simulate recombination process on as fine a map as possible, please omit the num_loci parameter (or set to None). Otherwise, num_loci is no longer supported and the behaviour of msprime 0.x cannot be emulated. Please consider upgrading your code to the version 1.x APIs.
If you get this error, please check whether specifying a
number of loci like this was actually what you intended. Almost
certainly you actually wanted to simulate a continuous genome
(omit the num_loci
parameter) or a discrete genome
with the breaks occurring integer boundaries (set num_loci
equal to the sequence length).
If not, please let us know your use case and we may be able to accommodate it in the new code. Until then, you will need to downgrade msprime to 0.7.x for your simulations to run.
Mutations#
Msprime 1.0 provides powerful new methods for simulating mutational
processes, adding support for finite-sites mutations and a
range of different mutation models.
Similarly to the approach for ancestry simulations, we introduce
a new function sim_mutations()
which allows us to provide
new, more appropriate defaults while still supporting older code.
Differences between the 1.x sim_mutations()
and 0.x mutate()
functions:
The
sim_mutations()
function works on a discrete genome by default.There are now also many new mutation models supported by
sim_mutations()
; see Mutation simulations for details. These are not supported in the deprecatedmutate()
function.The simulated mutations now have a simulated
time
value, which specifies the precise time that the mutation occurred. Note that this value is also provided in the returned tables for the deprecatedsimulate()
andmutate()
functions, which may lead to some compatibility issues. (We considered removing the simulated mutation times for these 0.x functions for strict compatibility, but this would have broken any code using thekeep
option in mutate.)
API Reference#
Ancestry#
- msprime.simulate()[source]#
Simulates the coalescent with recombination under the specified model parameters and returns the resulting
tskit.TreeSequence
. Note that Ne is the effective diploid population size (so the effective number of genomes in the population is 2*Ne), butsample_size
is the number of (monoploid) genomes sampled.Important
This function is deprecated (but supported indefinitely); please use
sim_ancestry()
in new code.- Parameters:
sample_size (int) – The number of sampled monoploid genomes. If not specified or None, this defaults to the sum of the subpopulation sample sizes. Either
sample_size
,population_configurations
orsamples
must be specified.Ne (float) – The effective (diploid) population size. This defaults to 1 if not specified.
length (float) – The length of the simulated region in bases. This parameter cannot be used along with
recombination_map
. Defaults to 1 if not specified.recombination_rate (float) – The rate of recombination per base per generation. This parameter cannot be used along with
recombination_map
. Defaults to 0 if not specified.recombination_map (
RecombinationMap
) – The map describing the changing rates of recombination along the simulated chromosome. This parameter cannot be used along with therecombination_rate
orlength
parameters, as these values are encoded within the map. Defaults to a uniform rate as described in therecombination_rate
parameter if not specified.mutation_rate (float) – The rate of infinite sites mutations per unit of sequence length per generation. If not specified, no mutations are generated. This option only allows for infinite sites mutations with a binary (i.e., 0/1) alphabet. For more control over the mutational process, please use the
mutate()
function.population_configurations (list or None) – The list of
PopulationConfiguration
instances describing the sampling configuration, relative sizes and growth rates of the populations to be simulated. If this is not specified, a single population with a sample of sizesample_size
is assumed.migration_matrix (list) – The matrix describing the rates of migration between all pairs of populations. If \(N\) populations are defined in the
population_configurations
parameter, then the migration matrix must be an \(N \times N\) matrix with 0 on the diagonal, consisting of \(N\) lists of length \(N\) or an \(N \times N\) numpy array. The \([j, k]^{th}\) element of the migration matrix gives the expected number of migrants moving from population \(k\) to population \(j\) per generation, divided by the size of population \(j\). When simulating from the discrete-time Wright-Fisher model (model = "dtwf"
), the row sums of the migration matrix must not exceed 1. There are no sum constraints for migration rates in continuous-time models.demographic_events (list) – The list of demographic events to simulate. Demographic events describe changes to the populations in the past. Events should be supplied in non-decreasing order of time in the past. Events with the same time value will be applied sequentially in the order that they were supplied before the simulation algorithm continues with the next time step.
samples (list) – The list specifying the location and time of all samples. This parameter may be used to specify historical samples, and cannot be used in conjunction with the
sample_size
parameter. Each sample is a (population
,time
) pair such that the sample in positionj
in the list of samples is drawn in the specified population at the specified time. Time is measured in generations ago, as elsewhere.random_seed (int) – The random seed. If this is None, a random seed will be automatically generated. Valid random seeds must be between 1 and \(2^{32} - 1\).
num_replicates (int) – The number of replicates of the specified parameters to simulate. If this is not specified or None, no replication is performed and a
tskit.TreeSequence
object returned. If num_replicates is provided, the specified number of replicates is performed, and an iterator over the resultingtskit.TreeSequence
objects returned.from_ts (tskit.TreeSequence) – If specified, initialise the simulation from the root segments of this tree sequence and return the updated tree sequence. Please see here for details on the required properties of this tree sequence and its interactions with other parameters. (Default: None).
start_time (float) – If specified, set the initial time that the simulation starts to this value. If not specified, the start time is zero if performing a simulation of a set of samples, or is the time of the oldest node if simulating from an existing tree sequence (see the
from_ts
parameter).end_time (float) – If specified, terminate the simulation at the specified time. In the returned tree sequence, all rootward paths from samples with time <= end_time will end in a node with one child with time equal to end_time. Sample nodes with time > end_time will also be present in the output tree sequence. If not specified or
None
, run the simulation until all samples have an MRCA at all positions in the genome.record_full_arg (bool) – If True, record all intermediate nodes arising from common ancestor and recombination events in the output tree sequence. This will result in unary nodes (i.e., nodes in marginal trees that have only one child). Defaults to False.
model (str or AncestryModel) – The simulation model to use. This can either be a string (e.g.,
"smc_prime"
) or an instance of a ancestry model class (e.g,msprime.DiscreteTimeWrightFisher()
.record_provenance (bool) – If True, record all input parameters in the tree sequence Provenance.
- Returns:
The
tskit.TreeSequence
object representing the results of the simulation if no replication is performed, or an iterator over the independent replicates simulated if the num_replicates parameter has been used.- Return type:
tskit.TreeSequence
or an iterator overtskit.TreeSequence
replicates.
- class msprime.PopulationConfiguration(sample_size=None, initial_size=None, growth_rate=0.0, metadata=None)[source]#
The initial configuration of a population (or deme) in a simulation.
Important
This class is deprecated (but supported indefinitely); please use the msprime 1.0 demography API in new code.
- Parameters:
sample_size (int) – The number of initial samples that are drawn from this population.
initial_size (float) – The absolute size of the population at time zero. Defaults to the reference population size \(N_e\).
growth_rate (float) – The forwards-time exponential growth rate of the population per generation. Growth rates can be negative. This is zero for a constant population size, and positive for a population that has been growing. Defaults to 0.
metadata (dict) – A JSON-encodable dictionary of metadata to associate with the corresponding Population in the output tree sequence. If not specified or None, no metadata is stored (i.e., an empty bytes array). Note that this metadata is ignored when using the
from_ts
argument tosimulate()
, as the population definitions in the tree sequence that is used as the starting point take precedence.
- class msprime.PopulationParametersChange(time, initial_size=None, growth_rate=None, population=None, population_id=None)[source]#
Changes the demographic parameters of a population at a given time.
This event generalises the
-eg
,-eG
,-en
and-eN
options fromms
. Note that unlikems
we do not automatically set growth rates to zero when the population size is changed.Important
This class is deprecated (but supported indefinitely); please use the
Demography.add_population_parameters_change()
method in new code.- Parameters:
time (float) – The length of time ago at which this event occurred.
initial_size (float) – The absolute diploid size of the population at the beginning of the time slice starting at
time
. If None, this is calculated according to the initial population size and growth rate over the preceding time slice.growth_rate (float) – The new per-generation growth rate. If None, the growth rate is not changed. Defaults to None.
population (int) – The ID of the population affected. If
population
is None, the changes affect all populations simultaneously.
- class msprime.MigrationRateChange(time, rate, source=-1, dest=-1, matrix_index=None)[source]#
Changes the rate of migration from one deme to another to a new value at a specific time. Migration rates are specified in terms of the rate at which lineages move from population
source
todest
during the progress of the simulation. Note thatsource
anddest
are from the perspective of the coalescent process; please see the Models section for more details on the interpretation of this migration model.By default,
source=-1
anddest=-1
, which results in all non-diagonal elements of the migration matrix being changed to the new rate. Ifsource
anddest
are specified, they must refer to valid population IDs.Important
This class is deprecated (but supported indefinitely); please use the
Demography.add_migration_rate_change()
method in new code.
- class msprime.MassMigration(time, source, dest=None, proportion=1.0, destination=None)[source]#
A mass migration event in which some fraction of the population in one deme (the
source
) simultaneously move to another deme (dest
) during the progress of the simulation. Each lineage currently present in the source population moves to the destination population with probability equal toproportion
. Note thatsource
anddest
are from the perspective of the coalescent process; please see the Models section for more details on the interpretation of this migration model.This event class generalises the population split (
-ej
) and admixture (-es
) events fromms
. Note that MassMigrations do not have any side effects on the migration matrix.Important
This class is deprecated (but supported indefinitely); please use the
Demography.add_mass_migration()
method in new code. In addition, please see the new higher-level Population split and Admixture events.- Parameters:
- class msprime.CensusEvent(time)[source]#
An event that adds a node to each branch of every tree at a given time during the simulation. This may be used to record all ancestral haplotypes present at that time, and to extract other information related to these haplotypes: for instance to trace the local ancestry of a sample back to a set of contemporaneous ancestors, or to assess whether a subset of samples has coalesced more recently than the census time.
See Census events for more details.
Important
This class is deprecated (but supported indefinitely); please use the
Demography.add_census()
method in new code.- Parameters:
time (float) – The time at which this event occurs in generations.
- class msprime.Sample(population, time)#
- class msprime.SimulationModelChange(time=None, model=None)[source]#
Demographic event denoting an change in ancestry model.
Important
This class is deprecated (but supported indefinitely); please use the
model
argument insim_ancestry()
to specify multiple models in new code.
Recombination maps#
- class msprime.RecombinationMap(positions, rates, num_loci=None, map_start=0)[source]#
A RecombinationMap represents the changing rates of recombination along a chromosome. This is defined via two lists of numbers:
positions
andrates
, which must be of the same length. Given an index j in these lists, the rate of recombination per base per generation isrates[j]
over the intervalpositions[j]
topositions[j + 1]
. Consequently, the first position must be zero, and by convention the last rate value is also required to be zero (although it is not used).Important
This class is deprecated (but supported indefinitely); please use the
RateMap
class in new code. In particular, note that when specifyingrates
in the theRateMap
class we now require an array of length \(n - 1\) (this class requires an array of length \(n\) in which the last entry is zero).- Parameters:
positions (list) – The positions (in bases) denoting the distinct intervals where recombination rates change. These can be floating point values.
rates (list) – The list of rates corresponding to the supplied
positions
. Recombination rates are specified per base, per generation.num_loci (int) – This parameter is no longer supported. Must be either None (meaning a continuous genome of the finest possible resolution) or be equal to
positions[-1]
(meaning a discrete genome). Any other value will result in an error. Please see the Genome discretisation section for more information.
- classmethod uniform_map(length, rate, num_loci=None)[source]#
Returns a
RecombinationMap
instance in which the recombination rate is constant over a chromosome of the specified length. The legacynum_loci
option is no longer supported and should not be used.
- classmethod read_hapmap(filename)[source]#
Parses the specified file in HapMap format.
Warning
This method is deprecated, use the
RateMap.read_hapmap()
method instead.- Parameters:
filename (str) – The name of the file to be parsed. This may be in plain text or gzipped plain text.
- Returns:
A RecombinationMap object.
- property mean_recombination_rate#
Return the weighted mean recombination rate across all windows of the entire recombination map.
Mutations#
- msprime.mutate(tree_sequence, rate=None, random_seed=None, model=None, keep=False, start_time=None, end_time=None)[source]#
Simulates mutations on the specified ancestry and returns the resulting
tskit.TreeSequence
. Mutations are generated at the specified rate in measured generations. Mutations are generated under the infinite sites model, and so the rate of new mutations is per unit of sequence length per generation.Note
This function is deprecated and we recommend new code use the
sim_mutations()
function instead, which is similar but offers more functionality and defaults to simulating mutations on a discrete genome. This function will be supported indefinitely, however.If a random seed is specified, this is used to seed the random number generator. If the same seed is specified and all other parameters are equal then the same mutations will be generated. If no random seed is specified then one is generated automatically.
If the
model
parameter is specified, this determines the model under which mutations are generated. By default mutations from the infinite sites model with a binary alphabet are generated.By default, sites and mutations in the parameter tree sequence are discarded. If the
keep
parameter is true, however, additional mutations are simulated. Under the infinite sites mutation model, all new mutations generated will occur at distinct positions from each other and from any existing mutations (by rejection sampling).The time interval over which mutations can occur may be controlled using the
start_time
andend_time
parameters. Thestart_time
defines the lower bound (in time-ago) on this interval andmax_time
the upper bound. Note that we may have mutations associated with nodes with time <=start_time
since mutations store the node at the bottom (i.e., towards the leaves) of the branch that they occur on.- Parameters:
tree_sequence (tskit.TreeSequence) – The tree sequence onto which we wish to throw mutations.
rate (float) – The rate of mutation per generation. (Default: 0).
random_seed (int) – The random seed. If this is None, a random seed will be automatically generated. Valid random seeds must be between 1 and \(2^{32} - 1\).
model (MutationModel) – The mutation model to use when generating mutations. If not specified or None, the
InfiniteSites
mutation model is used.keep (bool) – Whether to keep existing mutations (default: False).
start_time (float) – The minimum time at which a mutation can occur. (Default: no restriction.)
end_time (float) – The maximum time at which a mutation can occur (Default: no restriction).
- Returns:
The
tskit.TreeSequence
object resulting from overlaying mutations on the input tree sequence.- Return type: