Legacy (version 0.x) APIs#

Msprime 1.0 involved major changes to the internal details of how simulations are performed and the addition of a new set of APIs, which solve some long-standing issues and provide more sensible defaults.

Short version:

  • Your old simulations will still work and the msprime 0.x APIs will be supported indefinitely. We’re not going to break your code. The only situation in which your simulation might break is if you’ve been doing some obscure things with genome discretisation. Please see this section for more information.

  • The new APIs are much better, and support several features that are not available for the legacy API. In general, new features will only be added to the 1.x APIs and 0.x legacy APIs are in maintenance mode.

Upgrading code#

This section is to help legacy 0.x users of msprime get up to speed quickly, summarising the new APIs and their main differences to the 0.x versions.

The main change is that there are two new functions, sim_ancestry() and sim_mutations() which correspond to the 0.x functions simulate() and mutate(). The 0.x functions are deprecated but will continue to be supported indefinitely.

One major change is that ancestry and mutations must be simulated separately. See the Adding mutations section for idiomatic examples of how to do this efficiently.

Ancestry#

The new sim_ancestry() function replaces the 0.x simulate() function and is very similar. See the Ancestry simulations page for details and extensive examples using this function.

There are some important differences between simulate() and sim_ancestry():

  • The samples parameter now refers to the number of individuals rather than the number of nodes (i.e. monoploid genomes). Because the default ploidy is 2 (see the next point) the upshot is that sim_ancestry(2) will result in a tree sequence with four sample nodes, not two. (It is possible to override this behaviour using the list of SampleSet objects parameter to samples.)

  • The Ne parameter in 0.x simulate() function has been replaced with the population_size parameter.

  • There is now a Ploidy parameter, which has two effects:

    1. Sets the default number of sample nodes per individual

    2. Changes the timescale over which coalescence occurs. By default ploidy is 2 and so mean time to common ancestor in a population of size N is 2N generations, which is the same as msprime 0.x.

  • Rather than two parameters num_samples and samples, the sim_ancestry() function has a single parameter samples which has different behaviour depending on the type of parameters provided. See Specifying samples for details. Note in particular that a list of Sample objects is not supported.

  • Similarly, there is now one parameter recombination_rate which can be either a single value or a RateMap object. Note that the 0.x RecombinationMap is deprecated and not supported as input to sim_ancestry(). See Recombination for more details.

  • Simulations are performed on a discrete genome by default. To get the 0.x behaviour of a continuous genome, set discrete_genome=False. See the Discrete or continuous? section for more details.

  • The from_ts parameter used has been renamed to initial_state and accepts either a tskit.TableCollection or tskit.TreeSequence parameter. See the Specifying the initial state section for details.

  • There is no mutation_rate parameter to sim_ancestry(): use sim_mutations() instead.

  • The population_configurations, migration_matrix and demographic_events parameters have been replace with a single parameter demography, which must take a Demography instance. (See the next section for more details.)

Demography#

A new Demography object has been added for version 1.0 which encapsulates the functionality needed to define and debug demographic models in msprime. Demographic models can only be specified to sim_ancestry using an instance of this class.

See the Demographic models section for detailed documentation on this new interface.

Genome discretisation#

In msprime 0.x, recombination was implemented internally using a discrete number of genetic loci. That is, the simulation was performed in genetic coordinates, which were then mapped back to physical coordinates at the end of simulation. This had the significant advantage that recombination could be implemented during the simulation as a uniform process over these discrete loci. However, it led to many numerical issues encountered when mapping back and forth between genetic and physical coordinates as well as limiting what could be done in terms of gene conversion and other processes. We therefore changed to using physical coordinates throughout the simulation for msprime 1.0.

The simulations in 0.x and 1.x are almost entirely compatible and everything should work as expected when running 0.x code on msprime 1.0 or later. However, there is one (hopefully obscure) case in which code written for msprime 0.x will no longer work.

The num_loci argument to the 0.x class RecombinationMap was used to control the number of discrete genetic loci in the simulation. By default, this was set to a large number (\(\sim 2^{32}\)), effectively giving a continuous coordinate space when mapped back into physical units. By setting the num_loci equal to the sequence length of the RecombinationMap, we could also specify discrete physical loci. Specifying whether we simulate in discrete or continuous genome coordinates is now done using the discrete_genome argument to sim_ancestry() (see the Discrete or continuous? section for more details). The RateMap class is now used to specify varying rates of recombination along the genome and no longer has any concept of genetic “loci” — the choice of coordinate space is now decoupled from our specification of the recombination process.

Both the cases of discrete and fully continuous genomes are well supported in msprime 1.x and so nearly all existing code should continue to work as expected. What is no longer supported is specifying the “granularity” of the continuous space via the num_loci parameter, and if we try to set num_loci to anything other than the sequence length we get an error:

import msprime

# Here we try to make a sequence length of 10 with 5 discrete loci
recomb_map = msprime.RecombinationMap(positions=[0, 10], rates=[0.1, 0], num_loci=5)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[1], line 4
      1 import msprime
      3 # Here we try to make a sequence length of 10 with 5 discrete loci
----> 4 recomb_map = msprime.RecombinationMap(positions=[0, 10], rates=[0.1, 0], num_loci=5)

File ~/work/tskit-site/tskit-site/msprime/intervals.py:649, in RecombinationMap.__init__(self, positions, rates, num_loci, map_start)
    647 self._is_discrete = num_loci == positions[-1]
    648 if num_loci is not None and num_loci != positions[-1]:
--> 649     raise ValueError(
    650         "The RecombinationMap interface is deprecated and only "
    651         "partially supported. If you wish to simulate a number of "
    652         "discrete loci, you must set num_loci == the sequence length. "
    653         "If you wish to simulate recombination process on as fine "
    654         "a map as possible, please omit the num_loci parameter (or set "
    655         "to None). Otherwise, num_loci is no longer supported and "
    656         "the behaviour of msprime 0.x cannot be emulated. Please "
    657         "consider upgrading your code to the version 1.x APIs."
    658     )
    659 self.map = RateMap(position=positions, rate=rates[:-1])

ValueError: The RecombinationMap interface is deprecated and only partially supported. If you wish to simulate a number of discrete loci, you must set num_loci == the sequence length. If you wish to simulate recombination process on as fine a map as possible, please omit the num_loci parameter (or set to None). Otherwise, num_loci is no longer supported and the behaviour of msprime 0.x cannot be emulated. Please consider upgrading your code to the version 1.x APIs.

If you get this error, please check whether specifying a number of loci like this was actually what you intended. Almost certainly you actually wanted to simulate a continuous genome (omit the num_loci parameter) or a discrete genome with the breaks occurring integer boundaries (set num_loci equal to the sequence length).

If not, please let us know your use case and we may be able to accommodate it in the new code. Until then, you will need to downgrade msprime to 0.7.x for your simulations to run.

Mutations#

Msprime 1.0 provides powerful new methods for simulating mutational processes, adding support for finite-sites mutations and a range of different mutation models. Similarly to the approach for ancestry simulations, we introduce a new function sim_mutations() which allows us to provide new, more appropriate defaults while still supporting older code.

Differences between the 1.x sim_mutations() and 0.x mutate() functions:

  • The sim_mutations() function works on a discrete genome by default.

  • There are now also many new mutation models supported by sim_mutations(); see Mutation simulations for details. These are not supported in the deprecated mutate() function.

  • The simulated mutations now have a simulated time value, which specifies the precise time that the mutation occurred. Note that this value is also provided in the returned tables for the deprecated simulate() and mutate() functions, which may lead to some compatibility issues. (We considered removing the simulated mutation times for these 0.x functions for strict compatibility, but this would have broken any code using the keep option in mutate.)

API Reference#

Ancestry#

msprime.simulate()[source]#

Simulates the coalescent with recombination under the specified model parameters and returns the resulting tskit.TreeSequence. Note that Ne is the effective diploid population size (so the effective number of genomes in the population is 2*Ne), but sample_size is the number of (monoploid) genomes sampled.

Important

This function is deprecated (but supported indefinitely); please use sim_ancestry() in new code.

Parameters:
  • sample_size (int) – The number of sampled monoploid genomes. If not specified or None, this defaults to the sum of the subpopulation sample sizes. Either sample_size, population_configurations or samples must be specified.

  • Ne (float) – The effective (diploid) population size. This defaults to 1 if not specified.

  • length (float) – The length of the simulated region in bases. This parameter cannot be used along with recombination_map. Defaults to 1 if not specified.

  • recombination_rate (float) – The rate of recombination per base per generation. This parameter cannot be used along with recombination_map. Defaults to 0 if not specified.

  • recombination_map (RecombinationMap) – The map describing the changing rates of recombination along the simulated chromosome. This parameter cannot be used along with the recombination_rate or length parameters, as these values are encoded within the map. Defaults to a uniform rate as described in the recombination_rate parameter if not specified.

  • mutation_rate (float) – The rate of infinite sites mutations per unit of sequence length per generation. If not specified, no mutations are generated. This option only allows for infinite sites mutations with a binary (i.e., 0/1) alphabet. For more control over the mutational process, please use the mutate() function.

  • population_configurations (list or None) – The list of PopulationConfiguration instances describing the sampling configuration, relative sizes and growth rates of the populations to be simulated. If this is not specified, a single population with a sample of size sample_size is assumed.

  • migration_matrix (list) – The matrix describing the rates of migration between all pairs of populations. If \(N\) populations are defined in the population_configurations parameter, then the migration matrix must be an \(N \times N\) matrix with 0 on the diagonal, consisting of \(N\) lists of length \(N\) or an \(N \times N\) numpy array. The \([j, k]^{th}\) element of the migration matrix gives the expected number of migrants moving from population \(k\) to population \(j\) per generation, divided by the size of population \(j\). When simulating from the discrete-time Wright-Fisher model (model = "dtwf"), the row sums of the migration matrix must not exceed 1. There are no sum constraints for migration rates in continuous-time models.

  • demographic_events (list) – The list of demographic events to simulate. Demographic events describe changes to the populations in the past. Events should be supplied in non-decreasing order of time in the past. Events with the same time value will be applied sequentially in the order that they were supplied before the simulation algorithm continues with the next time step.

  • samples (list) – The list specifying the location and time of all samples. This parameter may be used to specify historical samples, and cannot be used in conjunction with the sample_size parameter. Each sample is a (population, time) pair such that the sample in position j in the list of samples is drawn in the specified population at the specified time. Time is measured in generations ago, as elsewhere.

  • random_seed (int) – The random seed. If this is None, a random seed will be automatically generated. Valid random seeds must be between 1 and \(2^{32} - 1\).

  • num_replicates (int) – The number of replicates of the specified parameters to simulate. If this is not specified or None, no replication is performed and a tskit.TreeSequence object returned. If num_replicates is provided, the specified number of replicates is performed, and an iterator over the resulting tskit.TreeSequence objects returned.

  • from_ts (tskit.TreeSequence) – If specified, initialise the simulation from the root segments of this tree sequence and return the updated tree sequence. Please see here for details on the required properties of this tree sequence and its interactions with other parameters. (Default: None).

  • start_time (float) – If specified, set the initial time that the simulation starts to this value. If not specified, the start time is zero if performing a simulation of a set of samples, or is the time of the oldest node if simulating from an existing tree sequence (see the from_ts parameter).

  • end_time (float) – If specified, terminate the simulation at the specified time. In the returned tree sequence, all rootward paths from samples with time <= end_time will end in a node with one child with time equal to end_time. Sample nodes with time > end_time will also be present in the output tree sequence. If not specified or None, run the simulation until all samples have an MRCA at all positions in the genome.

  • record_full_arg (bool) – If True, record all intermediate nodes arising from common ancestor and recombination events in the output tree sequence. This will result in unary nodes (i.e., nodes in marginal trees that have only one child). Defaults to False.

  • model (str or AncestryModel) – The simulation model to use. This can either be a string (e.g., "smc_prime") or an instance of a ancestry model class (e.g, msprime.DiscreteTimeWrightFisher().

  • record_provenance (bool) – If True, record all input parameters in the tree sequence Provenance.

Returns:

The tskit.TreeSequence object representing the results of the simulation if no replication is performed, or an iterator over the independent replicates simulated if the num_replicates parameter has been used.

Return type:

tskit.TreeSequence or an iterator over tskit.TreeSequence replicates.

class msprime.PopulationConfiguration(sample_size=None, initial_size=None, growth_rate=0.0, metadata=None)[source]#

The initial configuration of a population (or deme) in a simulation.

Important

This class is deprecated (but supported indefinitely); please use the msprime 1.0 demography API in new code.

Parameters:
  • sample_size (int) – The number of initial samples that are drawn from this population.

  • initial_size (float) – The absolute size of the population at time zero. Defaults to the reference population size \(N_e\).

  • growth_rate (float) – The forwards-time exponential growth rate of the population per generation. Growth rates can be negative. This is zero for a constant population size, and positive for a population that has been growing. Defaults to 0.

  • metadata (dict) – A JSON-encodable dictionary of metadata to associate with the corresponding Population in the output tree sequence. If not specified or None, no metadata is stored (i.e., an empty bytes array). Note that this metadata is ignored when using the from_ts argument to simulate(), as the population definitions in the tree sequence that is used as the starting point take precedence.

class msprime.PopulationParametersChange(time, initial_size=None, growth_rate=None, population=None, population_id=None)[source]#

Changes the demographic parameters of a population at a given time.

This event generalises the -eg, -eG, -en and -eN options from ms. Note that unlike ms we do not automatically set growth rates to zero when the population size is changed.

Important

This class is deprecated (but supported indefinitely); please use the Demography.add_population_parameters_change() method in new code.

Parameters:
  • time (float) – The length of time ago at which this event occurred.

  • initial_size (float) – The absolute diploid size of the population at the beginning of the time slice starting at time. If None, this is calculated according to the initial population size and growth rate over the preceding time slice.

  • growth_rate (float) – The new per-generation growth rate. If None, the growth rate is not changed. Defaults to None.

  • population (int) – The ID of the population affected. If population is None, the changes affect all populations simultaneously.

class msprime.MigrationRateChange(time, rate, source=-1, dest=-1, matrix_index=None)[source]#

Changes the rate of migration from one deme to another to a new value at a specific time. Migration rates are specified in terms of the rate at which lineages move from population source to dest during the progress of the simulation. Note that source and dest are from the perspective of the coalescent process; please see the Models section for more details on the interpretation of this migration model.

By default, source=-1 and dest=-1, which results in all non-diagonal elements of the migration matrix being changed to the new rate. If source and dest are specified, they must refer to valid population IDs.

Important

This class is deprecated (but supported indefinitely); please use the Demography.add_migration_rate_change() method in new code.

Parameters:
  • time (float) – The time at which this event occurs in generations.

  • rate (float) – The new per-generation migration rate.

  • source (int) – The ID of the source population.

  • dest (int) – The ID of the destination population.

class msprime.MassMigration(time, source, dest=None, proportion=1.0, destination=None)[source]#

A mass migration event in which some fraction of the population in one deme (the source) simultaneously move to another deme (dest) during the progress of the simulation. Each lineage currently present in the source population moves to the destination population with probability equal to proportion. Note that source and dest are from the perspective of the coalescent process; please see the Models section for more details on the interpretation of this migration model.

This event class generalises the population split (-ej) and admixture (-es) events from ms. Note that MassMigrations do not have any side effects on the migration matrix.

Important

This class is deprecated (but supported indefinitely); please use the Demography.add_mass_migration() method in new code. In addition, please see the new higher-level Population split and Admixture events.

Parameters:
  • time (float) – The time at which this event occurs in generations.

  • source (int) – The ID of the source population.

  • dest (int) – The ID of the destination population.

  • proportion (float) – The probability that any given lineage within the source population migrates to the destination population.

class msprime.CensusEvent(time)[source]#

An event that adds a node to each branch of every tree at a given time during the simulation. This may be used to record all ancestral haplotypes present at that time, and to extract other information related to these haplotypes: for instance to trace the local ancestry of a sample back to a set of contemporaneous ancestors, or to assess whether a subset of samples has coalesced more recently than the census time.

See Census events for more details.

Important

This class is deprecated (but supported indefinitely); please use the Demography.add_census() method in new code.

Parameters:

time (float) – The time at which this event occurs in generations.

class msprime.Sample(population, time)#
class msprime.SimulationModelChange(time=None, model=None)[source]#

Demographic event denoting an change in ancestry model.

Important

This class is deprecated (but supported indefinitely); please use the model argument in sim_ancestry() to specify multiple models in new code.

Recombination maps#

class msprime.RecombinationMap(positions, rates, num_loci=None, map_start=0)[source]#

A RecombinationMap represents the changing rates of recombination along a chromosome. This is defined via two lists of numbers: positions and rates, which must be of the same length. Given an index j in these lists, the rate of recombination per base per generation is rates[j] over the interval positions[j] to positions[j + 1]. Consequently, the first position must be zero, and by convention the last rate value is also required to be zero (although it is not used).

Important

This class is deprecated (but supported indefinitely); please use the RateMap class in new code. In particular, note that when specifying rates in the the RateMap class we now require an array of length \(n - 1\) (this class requires an array of length \(n\) in which the last entry is zero).

Parameters:
  • positions (list) – The positions (in bases) denoting the distinct intervals where recombination rates change. These can be floating point values.

  • rates (list) – The list of rates corresponding to the supplied positions. Recombination rates are specified per base, per generation.

  • num_loci (int) – This parameter is no longer supported. Must be either None (meaning a continuous genome of the finest possible resolution) or be equal to positions[-1] (meaning a discrete genome). Any other value will result in an error. Please see the Genome discretisation section for more information.

classmethod uniform_map(length, rate, num_loci=None)[source]#

Returns a RecombinationMap instance in which the recombination rate is constant over a chromosome of the specified length. The legacy num_loci option is no longer supported and should not be used.

Parameters:
  • length (float) – The length of the chromosome.

  • rate (float) – The rate of recombination per unit of sequence length along this chromosome.

  • num_loci (int) – This parameter is no longer supported.

classmethod read_hapmap(filename)[source]#

Parses the specified file in HapMap format.

Warning

This method is deprecated, use the RateMap.read_hapmap() method instead.

Parameters:

filename (str) – The name of the file to be parsed. This may be in plain text or gzipped plain text.

Returns:

A RecombinationMap object.

property mean_recombination_rate#

Return the weighted mean recombination rate across all windows of the entire recombination map.

get_total_recombination_rate()[source]#

Returns the effective recombination rate for this genetic map. This is the weighted mean of the rates across all intervals.

Mutations#

msprime.mutate(tree_sequence, rate=None, random_seed=None, model=None, keep=False, start_time=None, end_time=None)[source]#

Simulates mutations on the specified ancestry and returns the resulting tskit.TreeSequence. Mutations are generated at the specified rate in measured generations. Mutations are generated under the infinite sites model, and so the rate of new mutations is per unit of sequence length per generation.

Note

This function is deprecated and we recommend new code use the sim_mutations() function instead, which is similar but offers more functionality and defaults to simulating mutations on a discrete genome. This function will be supported indefinitely, however.

If a random seed is specified, this is used to seed the random number generator. If the same seed is specified and all other parameters are equal then the same mutations will be generated. If no random seed is specified then one is generated automatically.

If the model parameter is specified, this determines the model under which mutations are generated. By default mutations from the infinite sites model with a binary alphabet are generated.

By default, sites and mutations in the parameter tree sequence are discarded. If the keep parameter is true, however, additional mutations are simulated. Under the infinite sites mutation model, all new mutations generated will occur at distinct positions from each other and from any existing mutations (by rejection sampling).

The time interval over which mutations can occur may be controlled using the start_time and end_time parameters. The start_time defines the lower bound (in time-ago) on this interval and max_time the upper bound. Note that we may have mutations associated with nodes with time <= start_time since mutations store the node at the bottom (i.e., towards the leaves) of the branch that they occur on.

Parameters:
  • tree_sequence (tskit.TreeSequence) – The tree sequence onto which we wish to throw mutations.

  • rate (float) – The rate of mutation per generation. (Default: 0).

  • random_seed (int) – The random seed. If this is None, a random seed will be automatically generated. Valid random seeds must be between 1 and \(2^{32} - 1\).

  • model (MutationModel) – The mutation model to use when generating mutations. If not specified or None, the InfiniteSites mutation model is used.

  • keep (bool) – Whether to keep existing mutations (default: False).

  • start_time (float) – The minimum time at which a mutation can occur. (Default: no restriction.)

  • end_time (float) – The maximum time at which a mutation can occur (Default: no restriction).

Returns:

The tskit.TreeSequence object resulting from overlaying mutations on the input tree sequence.

Return type:

tskit.TreeSequence

class msprime.InfiniteSites(alphabet=0)[source]#