API Reference¶
This lists the detailed reference documentation for the msprime Python API. The reference documentation aims to be concise, precise and exhaustive; as such, it is not the best place to start if you are new to a particular piece of functionality. Please see the toplevel documentation for Ancestry simulations, Mutation simulations or Demographic models for discussion and examples of individual features.
Summary¶
Ancestry¶

Simulates an ancestral process described by the specified model, demography and samples, and return a 

Specify a set of exchangeable sample individuals with a given ploidy value from a population at a given time. 

The classical coalescent with recombination model (i.e., Hudson’s algorithm). 

The Sequentially Markov Coalescent (SMC) model defined by McVean and Cardin (2005). 

The SMC’ model defined by Marjoram and Wall (2006) as a refinement of the 

A discrete backwardstime WrightFisher model, with diploid backandforth recombination. 

A Lambdacoalescent with multiple mergers in the haploid cases, or a Xicoalescent with simultaneous multiple mergers in the polyploid case. 

A Lambdacoalescent with multiple mergers in the haploid cases, or a Xicoalescent with simultaneous multiple mergers in the polyploid case. 

A selective sweep that has occured in the history of the sample. 
Mutations¶

Simulates mutations on the specified ancestry and returns the resulting 

JukesCantor mutation model (Jukes and Cantor 1969). 

The Hasegawa, Kishino and Yano mutation model (Hasegawa et al. 

The F84 mutation model (Felsenstein and Churchill, 1996). 

The Generalised TimeReversible nucleotide mutation model, a general parameterization of a timereversible mutation process (Tavaré et al. 

The BLOSUM62 model of timereversible amino acid mutation. 

The PAM model of timereversible amino acid mutation. 

Basic binary mutation model with two alleles: 
Superclass of the specific mutation models with a finite set of states. 

An infinite alleles model of mutation in which each allele is a unique positive integer. 

An infinitealleles model of mutation producing “SLiMstyle” mutations. 
Demography¶

A single population in a 

The definition of a demographic model for an msprime simulation, consisting of a set of populations, a migration matrix, and a list of demographic events. 

Utilities to compute and display information about the state of populations during the different simulation epochs defined by demographic events. 
Reference documentation¶
Ancestry¶

msprime.
sim_ancestry
(samples=None, *, demography=None, sequence_length=None, discrete_genome=None, recombination_rate=None, gene_conversion_rate=None, gene_conversion_tract_length=None, population_size=None, ploidy=None, model=None, initial_state=None, start_time=None, end_time=None, record_migrations=None, record_full_arg=None, num_labels=None, random_seed=None, num_replicates=None, replicate_index=None, record_provenance=None)[source]¶ Simulates an ancestral process described by the specified model, demography and samples, and return a
tskit.TreeSequence
(or a sequence of replicate tree sequences). Parameters
samples – The sampled individuals as either an integer, specifying the number of individuals to sample in a singlepopulation model; or a list of
SampleSet
objects defining the properties of groups of similar samples; or as a mapping in which the keys are population identifiers (either an integer ID or string name) and the values are the number of samples to take from the corresponding population at its default sampling time. It is important to note that samples correspond to individuals here, and each sampled individual is usually associated with \(k\) sample nodes (or genomes) whenploidy
= \(k\). See the Specifying samples section for further details. Eithersamples
orinitial_state
must be specified.demography – The demographic model to simulate, describing the extant and ancestral populations, their population sizes and growth rates, their migration rates, and demographic events affecting the populations over time. See the Demographic models section for details on how to specify demographic models and Specifying samples for details on how to specify the populations that samples are drawn from. If not specified (or None) we default to a single population with constant size 1 (see also the
population_size
parameter).ploidy (int) – The number of monoploid genomes per sample individual (Default=2). See the Ploidy section for usage examples.
sequence_length (float) – The length of the genome sequence to simulate. See the Sequence length section for usage examples for this parameter and how it interacts with other parameters.
discrete_genome (bool) – If True (the default) simulation occurs in discrete genome coordinates such that recombination and gene conversion breakpoints always occur at integer positions. Thus, multiple (e.g.) recombinations can occur at the same genome position. If
discrete_genome
is False simulations are performed using continuous genome coordinates. In this case multiple events at precisely the same genome location are very unlikely (but technically possible). See the Discrete or continuous? section for usage examples.recombination_rate – The rate of recombination along the sequence; can be either a single value (specifying a single rate over the entire sequence) or an instance of
RateMap
. See the Recombination section for usage examples for this parameter and how it interacts with other parameters.gene_conversion_rate – The rate of gene conversion along the sequence. If provided, a value for
gene_conversion_tract_length
must also be specified. See the Gene conversion section for usage examples for this parameter and how it interacts with other parameters.gene_conversion_tract_length – The mean length of the gene conversion tracts. For discrete genomes the tract lengths are geometrically distributed with mean
gene_conversion_tract_length
, which must be greater than or equal to 1. For continuous genomes the tract lengths are exponentially distributed with meangene_conversion_tract_length
, which must be larger than 0.population_size – The size of the default single population
Demography
. If not specified, defaults to 1. Cannot be specified along with thedemography
parameter. See the Demographic models section for more details on demographic models and population sizes and the Demography section for usage examples.random_seed (int) – The random seed. If this is not specified or None, a highquality random seed will be automatically generated. Valid random seeds must be between 1 and \(2^{32}  1\). See the Random seeds section for usage examples.
num_replicates (int) – The number of replicates of the specified parameters to simulate. If this is not specified or None, no replication is performed and a
tskit.TreeSequence
object returned. If num_replicates is provided, the specified number of replicates is performed, and an iterator over the resultingtskit.TreeSequence
objects returned. See the Running replicate simulations section for examples.record_full_arg (bool) – If True, record all intermediate nodes arising from common ancestor and recombination events in the output tree sequence. This will result in unary nodes (i.e., nodes in marginal trees that have only one child). Defaults to False. See the Ancestral recombination graph section for examples.
record_migrations (bool) – If True, record all migration events that occur in the Migration Table of the output tree sequence. Defaults to False. See the Migration events section for examples.
initial_state (tskit.TreeSequence) – If specified, initialise the simulation from the root segments of this tree sequence and return the completed tree sequence. Please see Specifying the initial state for details of the required properties of this tree sequence and its interactions with other parameters. All information in the
initial_state
tables is preserved (including metadata) and included in the returned tree sequence. (Default: None).start_time (float) – If specified, set the initial time that the simulation starts to this value. If not specified, the start time is zero if performing a simulation of a set of samples, or is the time of the oldest node if simulating from an existing tree sequence (see the
initial_state
parameter). See the Setting the start time section for examples.end_time (float) – If specified, terminate the simulation at the specified time. In the returned tree sequence, all rootward paths from samples with time <
end_time
will end in a node with one child with time equal to end_time. Any sample nodes with time >=end_time
will also be present in the output tree sequence. If not specified orNone
, run the simulation until all samples have an MRCA at all positions in the genome. See the Stopping simulations early section for examples.record_provenance (bool) – If True (the default), record all input parameters in the tree sequence Provenance.
model (str or AncestryModel or list) – The ancestry model to use. This can be either a single instance of
AncestryModel
(or a string that can be interpreted as an ancestry model), or a list ofAncestryModel
instances. If theduration
attribute of any of these models is set, the simulation will be run until at most \(t + t_m\), where \(t\) is the simulation time when the model starts and \(t_m\) is the model’sduration
. If theduration
is not set, the simulation will continue until the model completes, the overallend_time
is reached, or overall coalescence. See the Specifying ancestry models section for more details, and the Models section for the available models and examples.
 Returns
The
tskit.TreeSequence
object representing the results of the simulation if no replication is performed, or an iterator over the independent replicates simulated if the num_replicates parameter has been used. Return type
tskit.TreeSequence
or an iterator overtskit.TreeSequence
replicates.

class
msprime.
SampleSet
(num_samples, population=None, time=None, ploidy=None)[source]¶ Specify a set of exchangeable sample individuals with a given ploidy value from a population at a given time. See the Specifying samples section for details and examples.

population
: Optional[Union[int, str]] = None¶ The population in which the samples are drawn. May be either a string name or integer ID (see Identifying populations details).

time
: Optional[float] = None¶ The time at which these samples are drawn. If not specified or None, defaults to the
Population.default_sampling_time
.

Models¶

class
msprime.
AncestryModel
(*, duration=None)[source]¶ Abstract superclass of all ancestry models.

duration
: Optional[float]¶ The time duration that this model should run for. If None, the model will run until completion (i.e., until the simulation coalesces or the model itself completes). Otherwise, this defines the maximum time duration which the model can run. See the Specifying ancestry models section for more details.


class
msprime.
StandardCoalescent
(*, duration=None)[source]¶ The classical coalescent with recombination model (i.e., Hudson’s algorithm). The string
"hudson"
can be used to refer to this model.This is a continuous time model in which the time to the next event is exponentially distributed with rates depending on the population size(s), migration rates, numbers of extant lineages and the amount of ancestral material currently present. See Kelleher et al. (2016) for a detailed description of the model and further references.

class
msprime.
SmcApproxCoalescent
(*, duration=None)[source]¶ The Sequentially Markov Coalescent (SMC) model defined by McVean and Cardin (2005). In the SMC, only common ancestor events that result in marginal coalescences are possible. Under this approximation, the marginal trees along the genome depend only on the immediately previous tree (i.e. are Markovian).
Note
This model is implemented using a naive rejection sampling approach and so it may not be any more efficient to simulate than the standard Hudson model.
The string
"smc"
can be used to refer to this model.

class
msprime.
SmcPrimeApproxCoalescent
(*, duration=None)[source]¶ The SMC’ model defined by Marjoram and Wall (2006) as a refinement of the
SMC
. The SMC’ extends the SMC by additionally allowing common ancestor events that join contiguous tracts of ancestral material (as well as events that result in marginal coalescences).Note
This model is implemented using a naive rejection sampling approach and so it may not be any more efficient to simulate than the standard Hudson model.
The string
"smc_prime"
can be used to refer to this model.

class
msprime.
DiscreteTimeWrightFisher
(*, duration=None)[source]¶ A discrete backwardstime WrightFisher model, with diploid backandforth recombination. The string
"dtwf"
can be used to refer to this model.WrightFisher simulations are performed very similarly to coalescent simulations, with all parameters denoting the same quantities in both models. Because events occur at discrete times however, the order in which they occur matters. Each generation consists of the following ordered events:
Migration events. As in the Hudson coalescent, these move single extant lineages between populations. Because migration events occur before lineages choose parents, migrant lineages choose parents from their new population in the same generation.
Demographic events. All events with previous_generation < event_time <= current_generation are carried out here.
Lineages draw parents. Each (monoploid) extant lineage draws a parent from their current population.
Diploid recombination. Each parent is diploid, so all child lineages recombine backandforth into the same two parental genome copies. These become two independent lineages in the next generation.
Historical sampling events. All historical samples with previous_generation < sample_time <= current_generation are inserted.

class
msprime.
BetaCoalescent
(*, duration=None, alpha=None, truncation_point=1.7976931348623157e+308)[source]¶ A Lambdacoalescent with multiple mergers in the haploid cases, or a Xicoalescent with simultaneous multiple mergers in the polyploid case.
There are two main differences between the Betacoalescent and the standard coalescent. Firstly, the number of lineages that take part in each common ancestor event is random, with distribution determined by moments of the \(Beta(2  \alpha, \alpha)\)distribution. In particular, when there are \(n\) lineages, each set of \(k \leq n\) of them participates in a common ancestor event at rate
\[\frac{1}{B(2  \alpha, \alpha)} \int_0^1 x^{k  \alpha  1} (1  x)^{n  k + \alpha  1} dx,\]where \(B(2  \alpha, \alpha)\) is the Betafunction.
If ploidy = 1, then all participating lineages merge into one common ancestor, corresponding to haploid, singleparent reproduction. If ploidy = \(p > 1\), all participating lineages split randomly into \(2 p\) groups, corresponding to twoparent reproduction with \(p\) copies of each chromosome per parent. All lineages within each group merge simultaneously.
Secondly, the number of generations between common ancestor events predicted by the Betacoalescent is proportional to \(N^{\alpha  1}\), where \(N\) is the population size. Specifically, the mean number of generations until two lineages undergo a common ancestor event is
\[G = \frac{m^{\alpha} N^{\alpha  1}}{\alpha B(2  \alpha, \alpha)},\]if ploidy = 1, and
\[G = \frac{2 p m^{\alpha} (N / 2)^{\alpha  1}} {\alpha B(2  \alpha, \alpha)},\]if ploidy = \(p > 1\), where \(m\) is the mean number of juveniles per family given by
\[m = 2 + \frac{2^{\alpha}}{3^{\alpha  1} (\alpha  1)},\]if ploidy > 1, and
\[m = 1 + \frac{1}{2^{\alpha  1} (\alpha  1)},\]if ploidy = 1.
In the polyploid case we divide the population size \(N\) by two because we assume the \(N\) polyploid individuals form \(N / 2\) twoparent families in which reproduction takes place.
Warning
The number of generations between common ancestor events \(G\) depends both on the population size \(N\) and \(\alpha\), and can be dramatically shorter than in the case of the standard coalescent. For \(\alpha \approx 1\) that is due to insensitivity of \(G\) to \(N\) — see Multiple merger coalescents for an illustration. For \(\alpha \approx 2\), \(G\) is almost linear in \(N\), but can nevertheless be small because \(B(2  \alpha, \alpha) \rightarrow \infty\) as \(\alpha \rightarrow 2\). As a result, population sizes must often be many orders of magnitude larger than census population sizes to obtain realistic amounts of diversity in simulated samples.
See Schweinsberg (2003) for the derivation of the common ancestor event rate, as well as the number of generations between common ancestor events. Note however that Schweinsberg (2003) only covers the haploid case. For details of the diploid extension, see Blath et al. (2013), and Birkner et al. (2018) for a diploid version of the Schweinsberg (2003) model specifically. The general polyploid model is analogous to the diploid case, with \(2 p\) available copies of parental chromsomes per common ancestor event, and hence up to \(2 p\) simultaneous mergers.
 Parameters
alpha (float) – Determines the degree of skewness in the family size distribution, and must satisfy \(1 < \alpha < 2\). Smaller values of \(\alpha\) correspond to greater skewness, and \(\alpha = 2\) would coincide with the standard coalescent.
truncation_point (float) – The maximum number of juveniles \(K\) born to one family as a fraction of the population size \(N\). Must satisfy \(0 < K \leq \inf\). Determines the maximum fraction of the population replaced by offspring in one reproduction event, \(\tau\), via \(\tau = K / (K + m)\), where \(m\) is the mean juvenile number above. The default is \(K = \inf\), which corresponds to the standard Betacoalescent with \(\tau = 1\). When \(K < \inf\), the number of lineages participating in a common ancestor event is determined by moments of the Beta:math:(2  alpha, alpha) distribution conditioned on not exceeding \(\tau\), and the Betafunction in the expression for \(G\) is replaced by the incomplete Betafunction \(B(\tau; 2  \alpha, \alpha)\).

class
msprime.
DiracCoalescent
(*, duration=None, psi=None, c=None)[source]¶ A Lambdacoalescent with multiple mergers in the haploid cases, or a Xicoalescent with simultaneous multiple mergers in the polyploid case.
The Diraccoalescent is an implementation of the model of Blath et al. (2013) The simulation proceeds similarly to the standard coalescent. In addition to binary common ancestor events at rate \(n (n  1) / 2\) when there are \(n\) lineages, potential multiple merger events take place at rate \(c > 0\). Each lineage participates in each multiple merger event independently with probability \(0 < \psi \leq 1\).
If ploidy = 1, then all participating lineages merge into one common ancestor, corresponding to haploid, singleparent reproduction. If ploidy = \(p > 1\), all participating lineages split randomly into \(2 p\) groups, corresponding to twoparent reproduction with \(p\) copies of each chromosome per parent. All lineages within each group merge simultaneously.
Warning
The Diraccoalescent is obtained as a scaling limit of Moran models, rather than WrightFisher models. As a consequence, the number of generations between coalescence events is proportional to \(N^2\), rather than \(N\) generations as in the standard coalescent. See Multiple merger coalescents for an illustration of how this affects simulation output in practice.
 Parameters
c (float) – Determines the rate of potential multiple merger events. We require \(c > 0\).
psi (float) – Determines the fraction of the population replaced by offspring in one large reproduction event, i.e. one reproduction event giving rise to potential multiple mergers when viewed backwards in time. We require \(0 < \psi \leq 1\).

class
msprime.
SweepGenicSelection
(*, duration=None, position=None, start_frequency=None, end_frequency=None, s=None, dt=None)[source]¶ A selective sweep that has occured in the history of the sample. This will lead to a burst of rapid coalescence near the selected site.
The strength of selection during the sweep is determined by the parameter \(s\). Here we define s such that the fitness of the three genotypes at our benefical locus are \(W_{bb}=1\), \(W_{Bb}=1 + s/2\), \(W_{BB}=1 + s\). Thus fitness of the heterozygote is intermediate to the two homozygotes.
The model is one of a structured coalescent where selective backgrounds are defined as in Braverman et al. (1995) The implementation details here follow closely those in discoal (Kern and Schrider, 2016), with the important difference that \(s\) in msprime is half of of that in discoal (i.e. for equivalent results use \(2s\)).
See Selective sweeps for examples and details on how to specify different types of sweeps.
Warning
Currently models with more than one population and a selective sweep are not implemented. Population size changes during the sweep are not yet possible in msprime.
 Parameters
position (float) – the location of the beneficial allele along the chromosome.
start_frequency (float) – population frequency of the benefical allele at the start of the selective sweep. E.g., for a de novo allele in a diploid population of size N, start frequency would be \(1/2N\).
end_frequency (float) – population frequency of the beneficial allele at the end of the selective sweep.
s (float) – \(s\) is the selection coefficient of the beneficial mutation.
dt (float) – dt is the small increment of time for stepping through the sweep phase of the model. a good rule of thumb is for this to be approximately \(1/40N\) or smaller.
Mutations¶

msprime.
sim_mutations
(tree_sequence, rate=None, *, random_seed=None, model=None, start_time=None, end_time=None, discrete_genome=None, keep=None)[source]¶ Simulates mutations on the specified ancestry and returns the resulting
tskit.TreeSequence
. Mutations are generated at the specified rate per unit of sequence length, per unit of time. By default, mutations are generated at discrete sites along the genome and multiple mutations can occur at any given site. A continuous sequence, infinitesites model can also be specified by setting thediscrete_genome
parameter to False.If the
model
parameter is specified, this determines the model of sequence evolution under which mutations are generated. The default mutation model is themsprime.JC69
, a symmetrical mutation model among the ACGT alleles. See the Models section for details of available models.If a
random_seed
is specified, this is used to seed the random number generator. If the same seed is specified and all other parameters are equal then the same mutations will be generated. If no random seed is specified then one is generated automatically.The time interval over which mutations can occur may be controlled using the
start_time
andend_time
parameters. Thestart_time
defines the lower bound (in timeago) on this interval andmax_time
the upper bound. Note that we may have mutations associated with nodes with time <=start_time
since mutations store the node at the bottom (i.e., towards the leaves) of the branch that they occur on.If the tree sequence already has mutations, these are by default retained, but can be discarded by passing
keep=False
. However, adding new mutations to a tree sequence with existing mutations must be done with caution, since it can lead to incorrect or nonsensical results if mutation probabilities differ by ancestral state. (As an extreme example, suppose that X>Y and X>Z are allowable transitions, but Y>Z is not. If a branch already has an X>Y mutation on it, then calling sim_mutations(…, keep=True) might insert an X>Z mutation above the existing mutation, thus implying the impossible chain X>Y>Z.) However, the effect on nucleotide models of mutation are generally very small.Note
Many mutation models will insert silent transitions (e.g., placing a mutation to A above an existing mutation to A). Such mutations are harmless and are required for us to guarantee the statistical properties of the process of sequentially adding mutations to a tree sequence.
 Parameters
tree_sequence (tskit.TreeSequence) – The tree sequence we wish to throw mutations onto.
rate (float) – The rate of mutation per unit of sequence length per unit time, as either a single number (for a uniform rate) or as a
RateMap
. (Default: 0).random_seed (int) – The random seed. If this is None, a random seed will be automatically generated. Valid random seeds must be between 1 and \(2^{32}  1\). See the Random seeds section for usage examples.
model (MutationModel) – The mutation model to use when generating mutations. This can either be a string (e.g.,
"jc69"
) or an instance of a simulation model class e.g,msprime.F84(kappa=0.5)
. If not specified or None, theJC69
mutation model is used. Please see the Models section for more details on specifying mutation models.start_time (float) – The minimum time ago at which a mutation can occur. (Default: no restriction.)
end_time (float) – The maximum time ago at which a mutation can occur (Default: no restriction).
discrete_genome (bool) – Whether to generate mutations at only integer positions along the genome (Default=True).
keep (bool) – Whether to keep existing mutations. (default: True)
 Returns
The
tskit.TreeSequence
object resulting from overlaying mutations on the input tree sequence. Return type
Models¶

class
msprime.
MatrixMutationModel
[source]¶ Superclass of the specific mutation models with a finite set of states. This class can be used to create a matrix mutation model with custom alleles and probabilities. For details of how this works see Mutation Matrix Models Details.
 Parameters
alleles (list[str]) – A list of the possible alleles generated by the model. Each entry is a string.
root_distribution (list[float]) – An array of float values the same length as
alleles
which should sum to 1. These values are used to determine the ancestral state of each mutated site. For example ifalleles
is['A','B']
androot_distribution
is[1,0]
all ancestral states will be'A'
.transition_matrix (list[float]) – A square 2dmatrix of transition probabilities where both dimensions are equal to the length of the
alleles
argument. These are used to determine the derived state at each mutation. Note that rows should sum to 1.

class
msprime.
BinaryMutationModel
[source]¶ Basic binary mutation model with two alleles:
"0"
and"1"
, and ancestral allele always"0"
.This is a
MatrixMutationModel
with alleles["0", "1"]
, root distribution[1.0, 0.0]
, and transition matrix that is either[[0.0, 1.0], [1.0, 0.0]]
(by default), or[[0.5, 0.5], [0.5, 0.5]]
(ifstate_independent=True
). Parameters
state_independent (bool) – Whether mutations will be generated in a way independent of the previous allelic state, which includes silent mutations. See Existing mutations for more details.

class
msprime.
JC69
[source]¶ JukesCantor mutation model (Jukes and Cantor 1969). Based on the standard ACGT nucleotides as alleleic states, this model assumes equal probabilities for ancestral state and equal probabilities for all possible transitions.
This is a
MatrixMutationModel
with alleles["A", "C", "G", "T"]
, root distribution[1/4, 1/4, 1/4, 1/4]
, and default transition matrix[[0, 1/3, 1/3, 1/3], [1/3, 0, 1/3, 1/3], [1/3, 1/3, 0, 1/3], [1/3, 1/3, 1/3, 0]]
. It has no free parameters.If
state_independent
is True, then instead all entries of the transition matrix will be 1/4, so onequarter of the mutations will be silent. See Existing mutations for more details.Citation: Jukes TH, Cantor CR (1969). Evolution of Protein Molecules. pp. 21132.
 Parameters
state_independent (bool) – Whether mutations will be generated in a way independent of the previous allelic state.

class
msprime.
HKY
[source]¶ The Hasegawa, Kishino and Yano mutation model (Hasegawa et al. 1985). Based on the standard ACGT nucleotides as alleleic states, this model allows different rates for transitions and transversions, and sets an equilibrium frequency for each nucleotide. In addition a custom ancestral frequency (
root_distribution
) can be specified. Withkappa=1.0
and the default values of the other arguments this model is equal toJC69
. This model is similar toF84
but with a differing parametrisation forkappa
.This model is parameterized by \(\kappa\) (
kappa
), the ratio of transition to transversion mutation rates, and \(\pi\) (equilibrium_frequencies
), the vector of equilibrium nucleotide frequencies. If this mutation model is used with amutation_rate
of \(\mu\), then the mutation rate from the ith to the jth allele is equal to \(Q_{ij}\), where\(\mathbf{Q} = \frac{\mu}{M} \times \begin{bmatrix} \cdot & \pi_C & \kappa \pi_G & \pi_T \\ \pi_A & \cdot & \pi_G & \kappa \pi_T \\ \kappa \pi_A & \pi_C & \cdot & \pi_T \\ \pi_A & \kappa \pi_C & \pi_G & \cdot \end{bmatrix}\)
Here \(M\) is an overall scaling factor on the mutation rate that can be computed as follows: let \(q_{ij} = \pi_j\) if i<>j is a transversion, and \(q_{ij} = \kappa \pi_j\) otherwise. Set \(q_{ii} = 0\), and then let \(M\) be the largest row sum of \(q\), i.e., \(M = \max_i \left( \sum_j q_{ij} \right)\). Then, this implementation as a
MatrixMutationModel
has the transition matrix \(P_{ij} = q_{ij} / M\), and \(P_{ii} = 1  \sum_{j\neq i} P_{ij}\).Note also that \(\kappa\) is the ratio of individual mutation rates, not the ratio of total transition to transversion rates, which would be \(\kappa/2\), as is used in some parameterizations.
If
state_independent
is true, then the above is modified by setting \(q_{ii} = \kappa \pi_i\). This makes the model completely stateindependent only if \(\kappa = 1\) by adding additional silent mutations. See Existing mutations for more details.Citation: Hasegawa M, Kishino H, Yano T (1985). “Dating of the humanape splitting by a molecular clock of mitochondrial DNA”. Journal of Molecular Evolution. 22 (2): 160–74.
Note that this implementation has the root distribution as a separate parameter, although it defaults to the equilibrium distribution. If you set the root distribution to be different than the equilbrium distribution, then you have a nonequilibrium model, and you should make sure that’s what you want.
 Parameters
kappa (float) – Scaling parameter by which the transition matrix is modified for transitions (A<>G, C<>T). For example a value of
0
means that no transitions will occur,1
makes the probability of transitions and transversions equal.equilibrium_frequencies (list[float]) – An array of float values of length 4 which should sum to 1. These values are used to determine equilibrium frequencies (longterm averages) of the alleles. Defaults to [0.25, 0.25, 0.25, 0.25], i.e., all nucleotides equally likely.
root_distribution (list[float]) – An array of float values of length 4 which should sum to 1. These values are used to determine the ancestral state of each mutational site. Defaults to the value of
equilibrium_frequencies
.state_independent (bool) – Whether to include a higher rate of silent mutations that makes the model closer to stateindependent.

class
msprime.
F84
[source]¶ The F84 mutation model (Felsenstein and Churchill, 1996). Based on the standard ACGT nucleotides as alleleic states, this model takes into account transitions and transversions, and sets an equilibrium frequency for each nucleotide. In addition a custom ancestral frequency (
root_distribution
) can be specified. Withkappa=1.0
and the default values of the other arguments this model is equal toJC69
. This model is similar toHKY
but with a differing parametrisation forkappa
.This model is parameterized by \(\kappa\) (
kappa
), the ratio of transition to transversion mutation rates, and \(\pi\) (equilibrium_frequencies
), the vector of equilibrium nucleotide frequencies. If this mutation model is used with amutation_rate
of \(\mu\), then the mutation rate from the ith to the jth allele is equal to \(Q_{ij}\), where\(\mathbf{Q} = \frac{\mu}{M} \times \begin{bmatrix} \cdot & \pi_C & \left(1 + \frac{\kappa1}{\pi_A + \pi_G}\right) \pi_G & \pi_T \\ \pi_A & \cdot & \pi_G & \left(1 + \frac{\kappa1}{\pi_C + \pi_T}\right) \pi_T \\ \left(1 + \frac{\kappa1}{\pi_A + \pi_G}\right) \pi_A & \pi_C & \cdot & \pi_T \\ \pi_A & \left(1 + \frac{\kappa1}{\pi_C + \pi_T}\right) \pi_C & \pi_G & \cdot \end{bmatrix}\)
Here \(M\) is an overall scaling factor on the mutation rate that can be computed as follows: let \(q_{ij} = \pi_j\) if i<>j is a transversion, \(q_{ij} = (1 + (\kappa1)/(\pi_A + \pi_G)) \pi_j\) if i and j are A and G in some order, \(q_{ij} = (1 + (\kappa1)/(\pi_C + \pi_T)) \pi_j\) if i and j are C and T in some order, and \(q_{ii} = 0\). Let \(M\) be the largest row sum of \(q\), i.e., \(M = \max_i \left( \sum_j q_{ij} \right)\). Then, this implementation as a
MatrixMutationModel
has the transition matrix \(P_{ij} = q_{ij} / M\), and \(P_{ii} = 1  \sum_{j\neq i} P_{ij}\).If
state_independent
is true, then the above is modified by setting \(q_{ii} = \kappa \pi_i\). This makes the model completely stateindependent only if \(\kappa = 1\) by adding additional silent mutations. See Existing mutations for more details.Citation: Felsenstein J, Churchill GA (January 1996). “A Hidden Markov Model approach to variation among sites in rate of evolution”. Molecular Biology and Evolution. 13 (1): 93–104.
Note that this implementation has the root distribution as a separate parameter, although it defaults to the equilibrium distribution. If you set the root distribution to be different than the equilbrium distribution, then you have a nonequilibrium model, and you should make sure that’s what you want.
 Parameters
kappa (float) – Scaling parameter by which the transition matrix is modified for transitions (A<>G, C<>T). Must be greater than or equal to the larger of \((\pi_A+\pi_G)/(\pi_C+\pi_T)\) and \((\pi_C+\pi_T)/(\pi_A+\pi_G)\).
equilibrium_frequencies (list[float]) – An array of float values of length 4 which should sum to 1. These values are used to determine equilibrium frequencies (longterm averages) of the alleles. Defaults to [0.25, 0.25, 0.25, 0.25], i.e all nucleotides equally likely.
root_distribution (list[float]) – An array of float values of length 4 which should sum to 1. These values are used to determine the ancestral state of each mutational site. Defaults to the value of
equilibrium_frequencies
.state_independent (bool) – Whether to include a higher rate of silent mutations that makes the model closer to stateindependent.

class
msprime.
GTR
[source]¶ The Generalised TimeReversible nucleotide mutation model, a general parameterization of a timereversible mutation process (Tavaré et al. 1986). It allows specification of pernucleotide equilibrium frequencies and equilibrium transition rates.
This model is parameterized by the vector \(r\) (
relative_rates
), and \(\pi\) (equilibrium_frequencies
), the vector of equilibrium nucleotide frequencies. The entries ofrelative_rates
are, in this order, \((r_{AC}, r_{AG}, r_{AT}, r_{CG}, r_{CT}, r_{GT})\). If this mutation model is used with amutation_rate
of \(\mu\), then the mutation rate from the ith to the jth allele is \(Q_{ij}\), where\(\mathbf{Q} = \frac{\mu}{M} \times \begin{bmatrix} \cdot & r_{AC} \pi_C & r_{AG} \pi_G & r_{AT} \pi_T \\ r_{AC} \pi_A & \cdot & r_{CG} \pi_G & r_{CT} \pi_T \\ r_{AG} \pi_A & r_{CG} \pi_C & \cdot & r_{GT} \pi_T \\ r_{AT} \pi_A & r_{CT} \pi_C & r_{GT} \pi_G & \cdot \end{bmatrix}\)
Here \(M\) is an overall scaling factor on the mutation rate that can be computed as follows: let \(q_{ij} = r_{ij} \pi_j\) and \(q_{ii} = 0\), and then let \(M\) be the largest row sum of \(q\), i.e., \(M = \max_i \left( \sum_j q_{ij} \right)\). Then, this implementation as a
MatrixMutationModel
has the transition matrix \(P_{ij} = q_{ij} / M\), and \(P_{ii}\) chosen so rows sum to one.If
state_independent
is true, then the above is modified by setting \(q_{ii} = \pi_i\). This makes the model completely stateindependent only if all \(r_{ij} = 1\) by adding additional silent mutations. See Existing mutations for more details.Citation: Tavaré S (1986). “Some Probabilistic and Statistical Problems in the Analysis of DNA Sequences”. Lectures on Mathematics in the Life Sciences. 17: 57–86.
Note that this implementation has the root distribution as a separate parameter, although it defaults to the equilibrium distribution. If you set the root distribution to be different than the equilbrium distribution, then you have a nonequilibrium model, and you should make sure that’s what you want.
 Parameters
relative_rates (list[float]) – Controls the relative rates of mutation between each pair of nucleotides, in order “A<>C”, “A<>G”, “A<>T”, “C<>G”, “C<>T”, and “G<>T”.
equilibrium_frequencies (list[float]) – The equilibrium base frequencies, a list of four probabilities that sum to one. (Default: equal frequencies.)
root_distribution (list[float]) – An array of float values of length 4 which should sum to 1. These values are used to determine the ancestral state of each mutational site. Defaults to the value of
equilibrium_frequencies
.state_independent (bool) – Whether to include a higher rate of silent mutations that makes the model closer to stateindependent.

class
msprime.
BLOSUM62
[source]¶ The BLOSUM62 model of timereversible amino acid mutation. This model has no free parameters.
The model is parameterized by a 20by20 symmetric matrix of relative rates, \(B\), and a vector of amino acid equilibrium frequencies, \(\pi\) (for the precise order, see this model’s .alleles attribute). If this mutation model is used with a
mutation_rate
of \(\mu\), then the mutation rate from the ith to the jth allele is \(Q_{ij}\), where\(Q_{ij} = \frac{\mu}{M} B_{ij} \pi_j,\)
where \(M\) is an overall scaling factor chosen so that the largest row sum in the matrix \(B_{ij} \pi_j /M\) is equal to one. In this model, \(M = 1.726203705809619\). This implementation as a
MatrixMutationModel
has transition matrix \(P_{ij} = B_{ij} \pi_{j} / M\), and \(P_{ii}\) chosen so rows sum to one, and root distribution equal to the equilibrium frequencies, \(\pi\), so the matrix \(B\) can be recovered as follows:model = msprime.BLOSUM62() M = 1.726203705809619 B = M * model.transition_matrix / model.root_distribution
Citation: The values of \(B\) and \(\pi\) used here were copied from SeqGen. The original citation is: Henikoff, S., and J. G. Henikoff. 1992. PNAS USA 89:1091510919, and the numerical values were provided in: Yu,Y.K., Wootton, J.C. and Altschul, S.F. (2003) The compositional adjustment of amino acid substitution matrices. Proc. Natl Acad. Sci., USA, 100, 15688–15693.

class
msprime.
PAM
[source]¶ The PAM model of timereversible amino acid mutation. This model has no free parameters.
The model is parameterized by a 20by20 symmetric matrix of relative rates, \(B\), and a vector of amino acid equilibrium frequencies, \(\pi\) (for the precise order, see this model’s .alleles attribute). If this mutation model is used with a
mutation_rate
of \(\mu\), then the mutation rate from the ith to the jth allele is \(Q_{ij}\), where\(Q_{ij} = \frac{\mu}{M} B_{ij} \pi_j,\)
where \(M\) is an overall scaling factor chosen so that the largest row sum in the matrix \(B_{ij} \pi_j /M\) is equal to one. In this model, \(M = 1.78314862248\). This implementation as a
MatrixMutationModel
has transition matrix \(P_{ij} = B_{ij} \pi_{j} / M\), and \(P_{ii}\) chosen so rows sum to one, and root distribution equal to the equilibrium frequencies, \(\pi\), so the matrix \(B\) can be recovered as follows:model = msprime.PAM() M = 1.78314862248 B = M * model.transition_matrix / model.root_distribution
Citation: The values of \(B\) and \(\pi\) used here were copied from SeqGen, and follow the “Dayhoff DCMut” model as described in Kosiol, C., and Goldman, N. 2005. Different versions of the Dayhoff rate matrix. Molecular Biology and Evolution 22:193199. The original citation is: Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C. (1978). A model of evolutionary change in proteins. Atlas of Protein Sequence Structures, Vol. 5, Suppl. 3, National Biomedical Research Foundation, Washington DC, pp. 345352.

class
msprime.
InfiniteAlleles
[source]¶ An infinite alleles model of mutation in which each allele is a unique positive integer. This works by keeping track of a “next allele”: each time the model is asked to produce a new allele (either for an ancestral or derived state), the “next allele” is provided, and then incremented.
Variables:
next_allele: The next allele to be assigned. This increments by one each time an allele is assigned to a new mutation, and resets to
start_allele
each time a new ancestral state is assigned. Parameters
start_allele (int) – The nonnegative integer to start assigning alleles from. (default: 0)

class
msprime.
SLiMMutationModel
[source]¶ An infinitealleles model of mutation producing “SLiMstyle” mutations.
To agree with mutations produced by SLiM, the ancestral state of each new site is set to the empty string, and each derived state is produced by appending the “next allele” to the previous state. The result is that each allele is a commaseparated string of all mutations that have occurred up to the root.
Mutations produced by SLiM carry both a
time
attribute, in units of “time ago” as usual for tskit, as well as a metadata attributed called “origin_generation”, in units of time since the start of the simulation. Adding these two together  time since the start of the simulation plus time until the end  is equal to the total number of generations of the simulation. The origin_generation is not currently used by SLiM, but for consistency, the origin_genration attribute for mutations produced by this model is set equal toslim_generation
minusfloor(mut.time)
, wheremut.time
is the (tskit) time ago of the mutation. Parameters
type (int) – The nonnegative integer defining the “type” of SLiM mutation that will be recorded in metadata.
next_id (int) – The nonnegative integer to start assigning alleles from. (default: 0)
slim_generation (int) – The “SLiM generation” time used in determining the “origin_generation” metadata attribute of mutations. This can usually be left at its default, which is 1.
block_size (int) – The block size for allocating derived states. You do not need to change this unless you get an “out of memory” error due to a very large number of stacked mutations.
Node flags¶
In the tskit Node Table node flags specify
particular properties about nodes. Msprime follows the standard approach
of setting the tskit.NODE_IS_SAMPLE
flag for all sample nodes,
with all other nodes having a flags value of 0.
Msprime defines some extra flags that help us to identify particular nodes in some situations:

msprime.
NODE_IS_RE_EVENT
¶ The node is an ARG recombination event. Each recombination event is marked with two nodes, one identifying the individual providing the genetic material to the left of the breakpoint and the other providing the genetic material the right. Only present if the
record_full_arg
option is specified.

msprime.
NODE_IS_CA_EVENT
¶ The node is an ARG common ancestor event that did not result in marginal coalescence. Only present if the
record_full_arg
option is specified.

msprime.
NODE_IS_MIG_EVENT
¶ The node is an ARG migration event identifying the individual that migrated. Can be used in combination with the
record_migrations
option. Only present if therecord_full_arg
option is specified.

msprime.
NODE_IS_CEN_EVENT
¶ The node was created by a census event. Please see the Census events section for more details.
Rate maps¶

class
msprime.
RateMap
(*, position, rate)[source]¶ A class mapping a nonnegative rate value to a set of nonoverlapping intervals along the genome. Intervals for which the rate is unknown (i.e., missing data) are encoded by NaN values in the
rate
array. Parameters

property
left
¶ The left position of each interval (inclusive).

property
right
¶ The right position of each interval (exclusive).

property
mid
¶ Returns the midpoint of each interval.

property
span
¶ Returns the span (i.e.,
right  left
) of each of the intervals.

property
position
¶ The breakpoint positions between intervals. This is equal to the
left
array with thesequence_length
appended.

property
rate
¶ The rate associated with each interval. Missing data is encoded by NaN values.

property
mass
¶ The “mass” of each interval, defined as the
rate
\(\times\)span
. This is NaN for intervals containing missing data.

property
missing
¶ A boolean array encoding whether each interval contains missing data. Equivalent to
np.isnan(rate_map.rate)

property
non_missing
¶ A boolean array encoding whether each interval contains nonmissing data. Equivalent to
np.logical_not(np.isnan(rate_map.rate))

property
num_intervals
¶ The total number of intervals in this map. Equal to
num_missing_intervals
+num_non_missing_intervals
.

property
num_missing_intervals
¶ Returns the number of missing intervals, i.e., those in which the
rate
value is a NaN.

property
num_non_missing_intervals
¶ The number of non missing intervals, i.e., those in which the
rate
value is not a NaN.

property
sequence_length
¶ The sequence length covered by this map

property
total_mass
¶ The cumulative total mass over the entire map.

property
mean_rate
¶ The mean rate over this map weighted by the span covered by each rate. Unknown intervals are excluded.

get_rate
(x)[source]¶ Return the rate at the specified list of positions.
Note
This function will return a NaN value for any positions that contain missing data.
 Parameters
x (numpy.ndarray) – The positions for which to return values.
 Returns
An array of rates, the same length as
x
. Return type

get_cumulative_mass
(x)[source]¶ Return the cumulative mass of the map up to (but not including) a given point for a list of positions along the map. This is equal to the integral of the rate from 0 to the point.
 Parameters
x (numpy.ndarray) – The positions for which to return values.
 Returns
An array of cumulative mass values, the same length as
x
 Return type

find_index
(x)[source]¶ Returns the index of the interval that the specified position falls within, such that
rate_map.left[index] <= x < self.rate_map.right[index]
.

missing_intervals
()[source]¶ Returns the left and right coordinates of the intervals containing missing data in this map as a 2D numpy array with shape (
num_missing_intervals
, 2). Each row of this returned array is therefore aleft
,right
tuple corresponding to the coordinates of the missing intervals. Returns
A numpy array of the coordinates of intervals containing missing data.
 Return type

slice
(left=None, right=None, *, trim=False)[source]¶ Returns a subset of this rate map in the specified interval.
 Parameters
left (float) – The left coordinate (inclusive) of the region to keep. If
None
, defaults to 0.right (float) – The right coordinate (exclusive) of the region to keep. If
None
, defaults to the sequence length.trim (bool) – If True, remove the flanking regions such that the sequence length of the new rate map is
right
left
. IfFalse
(default), do not change the coordinate system and mark the flanking regions as “unknown”.
 Returns
A new RateMap instance
 Return type

static
read_hapmap
(fileobj, sequence_length=None, *, has_header=True, position_col=None, rate_col=None, map_col=None)[source]¶ Parses the specified file in HapMap format and returns a RateMap. HapMap files must whitespacedelimited, and by default are assumed to contain a single header line (which is ignored). Each subsequent line then contains a physical position (in base pairs) and either a genetic map position (in centiMorgans) or a recombination rate (in centiMorgans per megabase); the rate between the current physical position (inclusive) and the physical position on the next line (exclusive) is taken as constant. By default, the second column of the file is taken as the physical position and the fourth column is taken as the genetic position, as seen in the following sample of the format:
Chromosome Position(bp) Rate(cM/Mb) Map(cM) chr10 48232 0.1614 0.002664 chr10 48486 0.1589 0.002705 chr10 50009 0.159 0.002947 chr10 52147 0.1574 0.003287 ... chr10 133762002 3.358 181.129345 chr10 133766368 0.000 181.144008
Note
The rows are all assumed to come from the same contig, and the first column is currently ignored. Therefore if you have a single file containing several contigs or chromosomes, you must must split it up into multiple files, and pass each one separately to this function.
 Parameters
fileobj (str) – Filename or file to read. This is passed directly to
numpy.loadtxt()
, so if the filename extension is .gz or .bz2, the file is decompressed firstsequence_length (float) – The total length of the map. If
None
, then assume it is the last physical position listed in the file. Otherwise it must be greater then or equal to the last physical position in the file, and the region between the last physical position and the sequence_length is padded with a rate of zero.has_header (bool) – If True (default), assume the file has a header row and ignore the first line of the file.
position_col (int) – The zerobased index of the column in the file specifying the physical position in base pairs. If
None
(default) assume an index of 1 (i.e. the second column).rate_col (int) – The zerobased index of the column in the file specifying the rate in cM/Mb. If
None
(default) do not use the rate column, but calculate rates using the genetic map positions, as specified inmap_col
. If the rate column is used, the interval from 0 to first physical position in the file is marked as unknown, and the last value in the rate column must be zero.map_col (int) – The zerobased index of the column in the file specifying the genetic map position in centiMorgans. If
None
(default), assume an index of 3 (i.e. the fourth column). If the first genetic position is 0 the interval from position 0 to the first physical position in the file is marked as unknown. Otherwise, act as if an additional row, specifying physical position 0 and genetic position 0, exists at the start of the file.
 Returns
A RateMap object.
 Return type
Demography¶

class
msprime.
Demography
[source]¶ The definition of a demographic model for an msprime simulation, consisting of a set of populations, a migration matrix, and a list of demographic events. See the Demographic models section for detailed documentation on how to define, debug and simulate with demography in msprime.
Please see the Demography objects section for details of how to access and update population information within a model.
Demography objects implement the Python
collections.abc.Mapping
protocol, in which the keys are either the populationname
or integerid
values (see the Identifying populations section for more information) and the values arePopulation
objects.In general, population references in methods such as
Demography.add_population_split()
can either be string names or integer IDs, and the two forms can be used interchangeably.
add_population
(*, initial_size, growth_rate=None, name=None, description=None, extra_metadata=None, default_sampling_time=None, initially_active=None)[source]¶ Adds a new
Population
to thisDemography
with the specified parameters. The new population will have ID equal to the the number of populations immediately beforeadd_population
is called, such that the first population added has ID 0, the next ID 1 and so on. If thename
is not specified, this defaults to"pop_{id}"
. An Initial size value must be specified (but may be zero). Parameters
initial_size (float) – The size of the population at time zero. See the Initial size section for more details and examples.
growth_rate (float) – The exponential growth rate of the population. See the Growth rate section for more details and examples.
name (str) – The humanreadable identifier for this population. If not specified, defaults to the string
"pop_{id}"
whereid
is the population’s integer ID. See Identifying populations for more details and recommendations on best practise.description (str) – A concise but informative description of what this population represents within the wider model. Defaults to the empty strings.
extra_metadata (dict) – Extra metadata to associate with this population that will be stored tree sequences output by
sim_ancestry()
. See the Metadata section for more details and examples.default_sampling_time (float) – The time at which samples will be taken from this population, if a time in not otherwise specified. By default this is determined by the details of the model, and whether populations are ancestral in Population split events. See the Default sampling time section for more details.
initially_active (bool) – Whether this population is initially active. By default this is determined by the details of the model, and whether populations are ancestral in Population split events. See the Life cycle section for more details.
 Returns
The new
Population
instance. Return type

set_migration_rate
(source, dest, rate)[source]¶ Sets the backwardstime rate of migration from the specified
source
population todest
to the specified value. This has the effect of settingdemography.migration_matrix[source, dest] = rate
. It is the rate at which a lineage currently insource
moves todest
as one follows the lineage back through time.Important
Note this is the backwards in time; migration rate and that
source
anddest
are from the perspective of lineages in the coalescent process. See Migration for more details and clarification on this vital point.The
source
anddest
populations can be referred to either by their integerid
or stringname
values.

set_symmetric_migration_rate
(populations, rate)[source]¶ Sets the symmetric migration rate between all pairs of populations in the specified list to the specified value. For a given pair of population IDs
j
andk
, this setsdemography.migration_matrix[j, k] = rate
anddemography.migration_matrix[k, j] = rate
.Populations may be specified either by their integer IDs or by their string names.

add_population_split
(time, *, derived, ancestral)[source]¶ Adds a population split event at the specified time. In a population split event all lineages from the (more recent) derived populations move to the (more ancient) ancestral population. Forwards in time, this corresponds to the ancestral population splitting into the derived populations.
See the Population split section for more details and examples.
In addition to moving lineages from the derived population(s) into the ancestral population, a population split has the following additional effects:
All derived populations are set to inactive.
All migration rates to and from the derived populations are set to 0.
Population sizes and growth rates for the derived populations are set to 0.
The
default_sampling_time
of theancestral
Population
is set to the time of this event, if thedefault_sampling_time
for the ancestral population has not already been set.

add_admixture
(time, *, derived, ancestral, proportions)[source]¶ Adds an admixture event at the specified time. In an admixture event all lineages from a (more recent)
derived
population move to a list of (more ancient)ancestral
populations according to a list ofproportions
, such that a given lineage has a probabilityproportions[j]
of being moved to the populationancestral[j]
. This movement of lineages backwards in time corresponds to the initial state of the admixed derived population the specifiedtime
being composed of individuals from the specifiedancestral
populations in the specifiedproportions
.See the Admixture section for more details and examples.
In addition to moving lineages from the derived population into the ancestral population(s), an admixture has the following additional effects:
The derived population is set to inactive.
The ancestral populations are set to active, if they are not already active.
All migration rates to and from the derived population are set to 0.
Population sizes and growth rates for the derived population are set to 0, and the poulation is marked as inactive.

add_mass_migration
(time, *, source, dest, proportion)[source]¶ Adds a mass migration (or “pulse migration”) event at the specified time. In a mass migration event, lineages in the
source
population are moved to thedest
population with probabilityproportion
. Forwardsintime, this corresponds to individuals migrating from populationdest
to populationsource
.Please see the Pulse (mass) migration section for more details and examples.
Warning
Mass migrations are an advanced feature and should only be used if the required population dynamics cannot be modelled by Population split or Admixture events.
Important
Note that
source
anddest
are from the perspective of the coalescent process, i.e. backwards in time; please see the Migration section for more details. Parameters
time (float) – The time at which this event occurs in generations.
int source (str,) – The population from which lineages are moved.
int dest (str,) – The population to which lineages are moved.
proportion (float) – For each lineage in the
source
population, this is the probability that it moves to thedest
population.

add_migration_rate_change
(time, *, rate, source=None, dest=None)[source]¶ Changes the rate of migration from one deme to another to a new value at a specific time. Migration rates are specified in terms of the rate at which lineages move from population
source
todest
during the progress of the simulation.Important
Note that
source
anddest
are from the perspective of the coalescent process, i.e. backwards in time; please see the Migration section for more details.By default,
source=None
anddest=None
, which results in all nondiagonal elements of the migration matrix being changed to the new rate. Ifsource
anddest
are specified, they must refer to valid populations (either integer IDs or string names).

add_symmetric_migration_rate_change
(time, populations, rate)[source]¶ Sets the symmetric migration rate between all pairs of populations in the specified list to the specified value. For a given pair of population IDs
j
andk
, this setsmigration_matrix[j, k] = rate
andmigration_matrix[k, j] = rate
.Please see the Migration section for more details.
Populations may be specified either by their integer IDs or by their string names.

add_population_parameters_change
(time, *, initial_size=None, growth_rate=None, population=None)[source]¶ Changes the size parameters of a population (or all populations) at a given time.
Please see the Populations section for more details.
 Parameters
time (float) – The length of time ago at which this event occurred.
initial_size (float) – The absolute size of the population at the beginning of the time slice starting at
time
. If None, the initial_size of the population is computed according to the initial population size and growth rate over the preceding time slice.growth_rate (float) – The new pergeneration growth rate. If None, the growth rate is not changed. Defaults to None.
int population (str,) – The ID of the population affected. If
population
is None, the changes affect all populations simultaneously.

add_simple_bottleneck
(time, population, proportion=None)[source]¶ Adds a population bottleneck at the specified time in which each lineage has probablility equal to
proportion
of coalescing into a single ancestor.Please see the Simple bottleneck section for more details.

add_instantaneous_bottleneck
(time, *, population, strength)[source]¶ Adds a bottleneck at the specified time in the specified population that is equivalent to the coalescent process running for
strength
generations.Please see the Instantaneous bottleneck section for more details.
Note
The ploidy is also use to scale the time scale of the coalescent process during the bottleneck.

add_census
(time)[source]¶ Adds a “census” event at the specified time. In a census we add a node to each branch of every tree, thus recording the population that each lineage is in at the specified time.
This may be used to record all ancestral haplotypes present at that time, and to extract other information related to these haplotypes: for instance to trace the local ancestry of a sample back to a set of contemporaneous ancestors, or to assess whether a subset of samples has coalesced more recently than the census time.
See Census events for more details.
Warning
When used in the conjunction with the DTWF model noninteger census times should be used to guarantee that the census nodes don’t coincide with coalescences (and therefore zero branch length errors). See Using the DTWF model for more details.
 Parameters
time (float) – The time at which the census should occur.

validate
()[source]¶ Checks the demography looks sensible and raises errors/warnings appropriately, and return a copy in which all default values have been appropriately resolved.

copy
(populations=None)[source]¶ Returns a copy of this model. If the
populations
argument is specified, the populations in the copied model will be in this order. Parameters
populations (list) – A list of population identifiers defining the order of the populations in the new model. If not specified, the current order is used.
 Returns
A copy of this Demography.

debug
()[source]¶ Returns a
DemographyDebugger
instance for this demography. Returns
A DemographyDebugger object for this demography.
 Return type

assert_equal
(other)[source]¶ Compares this Demography with specified
other
and raises an AssertionError if they are not exactly equal. Parameters
other (Demography) – The other demography to compare against.

is_equivalent
(other, rel_tol=None, abs_tol=None)[source]¶ Compares this demography with the other and return True if they are equivalent up to the specified numerical tolerances. Two demographies are equivalent if, they have the same set of epochs defined by demographic events, and for each epoch:
The population’s
initial_size
,growth_rate
andactive
values are equal in all populations.The migration matrices are equal
The same sequence of lineage movements through population splits, etc.
All numerical comparisons are performed using
math.isclose()
. Parameters
other (Demography) – The other demography to compare against.
rel_tol (float) – The relative tolerance used by math.isclose.
abs_tol (float) – The relative tolerance used by math.isclose.
 Returns
True if this demography and other are equivalent up to numerical tolerances.
 Rtype bool
bool

static
from_species_tree
(tree, initial_size, *, time_units='gen', generation_time=None, growth_rate=None)[source]¶ Parse a species tree in Newick format and return the corresponding
Demography
object. The tree is assumed to be rooted and ultrametric and branch lengths must be included and correspond to time, either in units of millions of years, years, or generations.The returned
Demography
object contains aPopulation
for each node in the species tree. The population’sname
attribute will be either the corresponding node label from the newick tree, if it exists, or otherwise the name takes the form “pop_{j}”, where j is the position of the given population in the list. Leaf populations are first in the list, and added in lefttoright order. Populations corresponding to the internal nodes are then added in a postorder traversal of the species tree. For each internal node a Population split event is added so that lineages move from its child populations at the appropriate time and rates of continuous migration to and from the child populations is set to zero. See the Population split section for more details.The initial sizes and growth rates for the populations in the model are set via the
initial_size
andgrowth_rate
arguments. These can be specified in two ways: if a single number is provided, this is used for all populations. The argument may also be a mapping from population names to their respective values. For example:tree = "(A:10.0,B:10.0)C" initial_size = {"A": 1000, "B": 2000, "C": 100} demography = msprime.Demography.from_species_tree(tree, initial_size)
Note that it is possible to have default population sizes for unnamed ancestral populations using a
collections.defaultdict
, e.g.,tree = "(A:10.0,B:10.0)" initial_size = collections.defaultdict(lambda: 100) initial_size.update({"A": 1000, "B": 2000}) demography = msprime.Demography.from_species_tree(tree, initial_size)
 Parameters
tree (str) – The tree string in Newick format, with named leaves and branch lengths.
initial_size – Each population’s initial_size. May be a single number or a mapping from population names to their sizes.
growth_rate – Each population’s growth_rate. May be a single number or a mapping from population names to their exponential growth rates. Defaults to zero.
time_units (str) – The units of time in which the species tree’s branch lengths are measured. Allowed branch length units are millions of years, years, and generations; these should be specified with the strings
"myr"
,"yr"
, or"gen"
, respectively. This defaults to"gen"
.generation_time (float) – The number of years per generation. If and only if the branch lengths are not in units of generations, the generation time must be specified. This defaults to None.
 Returns
A Demography object representing the specified species tree.
 Return type

static
from_starbeast
(tree, generation_time, time_units='myr')[source]¶ Parse a species tree produced by the program TreeAnnotator based on a posterior tree distribution generated with StarBEAST and return the corresponding Demography object.
Species trees produced by TreeAnnotator are written in Nexus format and are rooted, bifurcating, and ultrametric. Branch lengths usually are in units of millions of years, but the use of other units is permitted by StarBEAST (and thus TreeAnnotator). This function allows branch length units of millions of years or years. Leaves must be named and the tree must include information on population sizes of leaf and ancestral species in the form of annotation with the “dmv” tag, which is the case for trees written by TreeAnnotator based on StarBEAST posterior tree distributions.
The returned
Demography
object contains aPopulation
for each node in the species tree. The population’sname
attribute will be either the corresponding node label from the newick tree, if it exists, or otherwise the name takes the form “pop_{j}”, where j is the position of the given population in the list. Leaf populations are first in the list, and added in lefttoright order. Populations corresponding to the internal nodes are then added in a postorder traversal of the species tree. For each internal node a Population split event is added so that lineages move from its child populations at the appropriate time and rates of continuous migration to and from the child populations is set to zero. See the Population split section for more details. Parameters
tree (str) – The tree string in Nexus format, with named leaves, branch lengths, and branch annotation. Typically, this string is the entire content of a file written by TreeAnnotator.
generation_time (float) – The number of years per generation.
time_units (str) – The units of time in which the species tree’s branch lengths are measured. Allowed branch length units are millions of years, and years; these should be specified with the strings
"myr"
or"yr"
, respectively. This defaults to"myr"
.
 Returns
A
Demography
instance that describing the information in the specified species tree. Return type

static
from_old_style
(population_configurations=None, *, migration_matrix=None, demographic_events=None, Ne=1, ignore_sample_size=False, population_map=None)[source]¶ Creates a Demography object from the pre 1.0 style input parameters, reproducing the old semantics with respect to default values.
No sample information is stored in the newstyle
Demography
objects, and therefore if thesample_size
attribute of any of the inputPopulationConfiguration
objects is set a ValueError will be raised by default. However, if theignore_sample_size
parameter is set to True, this check will not be performed and the sample sizes specified in the oldstylePopulationConfiguration
objects will be ignored.Please see the Specifying samples section for details on how to specify sample locations in
sim_ancestry()
.Todo
Document the remaining parameters.

static
from_tree_sequence
(ts, initial_size=0)[source]¶ Creates a
Demography
object based on the information in the specifiedtskit.TreeSequence
. The returned demography will contain a population for each of the populations in the tree sequence, in the same order.The metadata for each population in the tree sequence will be inspected. If a schema is present and the metadata can be decoded, the
name
anddescription
properties of populations are set if the corresponding keys are present.If the metadata cannot be decoded, the default values for
name
anddescription
are used.The
initial_size
of each of the new populations is set to zero by default, and all otherPopulation
attributes are set to their default values. It is therefore essential to update theinitial_size
andgrowth_rate
values to reflect the desired demography.See also
See the initial state section for examples of how this method can be used.
 Parameters
ts (tskit.TreeSequence) – The tree sequence to extract population information from.
initial_size (float) – The default initial size for the newly added populations (Default=0).
 Returns
A Demography object representing the populations in the specified tree sequence.
 Return type

static
isolated_model
(initial_size, *, growth_rate=None)[source]¶ Returns a
Demography
object representing a collection of isolated populations with specified initial population sizes and growth rates. Please see Demographic models for more details on population sizes and growth rates. Parameters
initial_size (list) – the
initial_size
value for each of thePopulation
in the returned model. The length of the array corresponds to the number of populations. model.growth_rate (list) – The exponential growth rate for each population. Must be either None (the default, resulting a zero growth rate) or an array with the same length as
initial_size
.
 Returns
A Demography object representing this model, suitable as input to
simulate()
. Return type

static
island_model
(initial_size, migration_rate, *, growth_rate=None)[source]¶ Returns a
Demography
object representing a collection of populations with specified initial population sizes and growth rates, with symmetric migration between each pair of populations at the specified rate. Please see Demographic models for more details on population sizes and growth rates. Parameters
initial_size (list) – the
initial_size
value for each of thePopulation
in the returned model. The length of the array corresponds to the number of populations. model.migration_rate (float) – The migration rate between each pair of populations.
growth_rate (list) – The exponential growth rate for each population. Must be either None (the default, resulting a zero growth rate) or an array with the same length as
initial_size
.
 Returns
A Demography object representing this model, suitable as input to
simulate()
. Return type

static
stepping_stone_model
(initial_size, migration_rate, *, growth_rate=None, boundaries=False)[source]¶ Returns a
Demography
object representing a collection of populations with specified initial population sizes and growth rates, in which adjacent demes exchange migrants at the specified rate. Please see Demographic models for more details on population sizes and growth rates.Note
The current implementation only supports a onedimensional stepping stone model, but higher dimensions could also be supported. Please open an issue on GitHub if this feature would be useful to you.
 Parameters
initial_size (list) – the
initial_size
value for each of thePopulation
in the returned model. The length of the array corresponds to the number of populations.migration_rate (float) – The migration rate between adjacent pairs of populations.
growth_rate (list) – The exponential growth rate for each population. Must be either None (the default, resulting a zero growth rate) or an array with the same length as
initial_size
.boundaries (bool) – If True the stepping stone model has boundary conditions imposed so that demes at either end of the chain do not exchange migrats. If False (the default), the set of populations is “circular” and migration takes place between the terminal demes.
 Returns
A Demography object representing this model, suitable as input to
simulate()
. Return type


class
msprime.
Population
(initial_size=0.0, growth_rate=0.0, name=None, description='', extra_metadata=<factory>, default_sampling_time=None, initially_active=None, id=None)[source]¶ A single population in a
Demography
. See the Populations section for more information on what populations represent, and how they can be used.Warning
This class should not be instantiated directly. Please use
Demography.add_population()
method instead.
initial_size
: float = 0.0¶ The absolute size of the population at time zero. See the Initial size section for more details and examples.

growth_rate
: float = 0.0¶ The exponential growth rate of the population per generation (forwards in time). Growth rates can be negative. This is zero for a constant population size, and positive for a population that has been growing. See the Growth rate section for more details and examples.

name
: Optional[str] = None¶ The name of the population. If specified this must be a uniquely identifying string and must be a valid Python identifier (i.e., could be used as a variable name in Python code). See Identifying populations for more details and recommendations on best practise.

description
: str = ''¶ A concise description of the population. Defaults to the empty string if not specified.

extra_metadata
: dict¶ A JSONencodable dictionary of metadata items to be stored in the associated tskit population object. This dictionary must not contain keys for any of the predefined metadata items. See the Metadata section for more details and examples.

default_sampling_time
: Optional[float] = None¶ The default time at which samples are drawn from this population. See the Default sampling time section for more details.

initially_active
: Optional[bool] = None¶ If True, this population will always be initially active, regardless of whether it participates in a Population split. If not set, or None, the initial state of the population will be set automatically depending on the events declared in the demography. See the Life cycle section for more details.

id
: Optional[int] = None¶ The integer ID of this population within the parent
Demography
. This attribute is assigned by the Demography class and should not be set or changed by user code.


class
msprime.
DemographyDebugger
(Ne=1, population_configurations=None, migration_matrix=None, demographic_events=None, model=None, *, demography=None)[source]¶ Utilities to compute and display information about the state of populations during the different simulation epochs defined by demographic events.
Warning
This class is not intended to be instantiated directly using the contructor  please use
Demography.debug()
to obtain a DemographyDebugger for a givenDemography
instead.
print_history
(output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf8'>)[source]¶ Prints a summary of the history of the populations.
Deprecated since 1.0: use
print(debugger)
instead.

property
population_size_history
¶ Returns a (num_pops, num_epochs) numpy array giving the starting population size for each population in each epoch.

property
epoch_start_time
¶ The array of epoch start_times defined by the demographic model.

property
num_epochs
¶ Returns the number of epochs defined by the demographic model.

population_size_trajectory
(steps)[source]¶ Return an array of perpopulation population sizes, as defined by the demographic model. These are the initial_size parameters of the model, modified by any population growth rates. The sizes are computed at the time points given by steps.
 Parameters
steps (list) – List of times ago at which the population size will be computed.
 Returns
Returns a numpy array of population sizes, with one column per population, whose [i,j]th entry is the size of population j at time steps[i] ago.

lineage_probabilities
(steps, sample_time=0)[source]¶ Returns an array such that P[j, a, b] is the probability that a lineage that started in population a at time sample_time is in population b at time steps[j] ago.
This function reports sampling probabilities _before_ mass migration events (or other events that move lineages) at a step time, if a mass migration event occurs at one of those times. Migrations will then effect the next time step.
 Parameters
steps (list) – A list of times to compute probabilities.
sample_time – The time of sampling of the lineage. For any times in steps that are more recent than sample_time, the probability of finding the lineage in any population is zero.
 Returns
An array of dimension len(steps) by num pops by num_pops.

possible_lineage_locations
(samples=None)[source]¶ Given the sampling configuration, this function determines when lineages are possibly found within each population over epochs defined by demographic events and sampling times. If no sampling configuration is given, we assume we sample lineages from every population at time zero.
The epoch intervals returned are those in which there are distinct configurations of possible lineage locations, and so the number of returned epochs may be less than the total number of epochs defined by the demography and will depend on the input sample configuration.
The samples are specified by either a list of population identifiers ( integer IDs or string names) or by a list of
SampleSet
objects, allowing sampling times to be specified explicitly. If thetime
field of theSampleSet
is not specified (or population IDs are used) samples are taken at the population’s default_sampling_time. OnlySampleSet
objects withnum_samples > 0
are counted as contributing samples to a particular population.To support legacy code,
Sample
objects from the 0.x API can also provided, although its use is discouraged in new code. Parameters
samples (list) – The populations that we sample from. Can be either a list of population identifiers,
SampleSet
orSample
objects. Returns
Returns a dictionary with epoch intervals as keys whose values are a list with length equal to the number of populations with True and False indicating which populations could possibly contain lineages over that epoch. The epoch intervals are given by tuples: (epoch start, epoch end). The first epoch necessarily starts at time 0, and the final epoch has end time of infinity.

mean_coalescence_time
(lineages, min_pop_size=1, steps=None, rtol=0.005, max_iter=12)[source]¶ Compute the mean time until coalescence between pairs of the specified sample
lineages
. Sample lineages are specified as a mapping from populations to the number of monoploid sample genomes present in that population at time zero. See the Coalescence rates and mean times section for usage examples and more details.Important
This function assumes a diploid model when computing coalescence rates (see the Coalescent time scales section for more information).
The calculation is performed by using
coalescence_rate_trajectory()
to compute the probability that the lineages have not yet coalesced by time t, and using these to approximate \(E[T] = \int_t^\infty P(T > t) dt\), where \(T\) is the coalescence time. Seecoalescence_rate_trajectory()
for more details.To compute this, an adequate time discretization must be arrived at by iteratively extending or refining the current discretization. Debugging information about numerical convergence of this procedure is logged using the Python
logging
infrastructure. The daiquiri module is a convenient way to set up logging, and we can use it to make these messages appear on stderr like this:import daiquiri daiquiri.setup(level="DEBUG") debugger.mean_coalescence_time(1)
Briefly, this outputs iteration number, mean coalescence time, maximum difference in probabilty of not having coalesced yet, difference to last coalescence time, probability of not having coalesced by the final time point, and whether the last iteration was an extension or refinement.
 Parameters
lineages (dict) – A mapping of populations (either integer IDs or string names: see the Identifying populations section for more details) to the number of monoploid sample lineages in that population.
min_pop_size (int) – See
coalescence_rate_trajectory()
.steps (list) – The time discretization to start out with (by default, picks something based on epoch times).
rtol (float) – The relative tolerance to determine mean coalescence time to (used to decide when to stop subdividing the steps).
max_iter (int) – The maximum number of times to subdivide the steps.
 Returns
The mean coalescence time (a number).
 Return type

coalescence_rate_trajectory
(steps, lineages, min_pop_size=1, double_step_validation=True)[source]¶ Calculate the mean coalescence rates and proportions of uncoalesced lineages between the specified sample lineages, at each of the times ago listed by steps, in this demographic model. Sample lineages are specified as a mapping from populations to the number of monoploid sample genomes present in that population at time zero. See the Inverse instantaneous coalescence rates section for usage examples and more details.
The coalescence rate at time t in the past is the average rate of coalescence of asyetuncoalesed lineages, computed as follows: let \(p(t)\) be the probability that the lineages of a randomly chosen pair of samples has not yet coalesced by time \(t\), let \(p(z,t)\) be the probability that the lineages of a randomly chosen pair of samples has not yet coalesced by time \(t\) and are both in population \(z\), and let \(N(z,t)\) be the diploid effective population size of population \(z\) at time \(t\). Then the mean coalescence rate at time \(t\) is \(r(t) = (\sum_z p(z,t) / (2 * N(z,t)) / p(t)\).
The computation is done by approximating population size trajectories with piecewise constant trajectories between each of the steps. For this to be accurate, the distance between the steps must be small enough so that (a) short epochs (e.g., bottlenecks) are not missed, and (b) populations do not change in size too much over that time, if they are growing or shrinking. This function optionally provides a simple check of this approximation by recomputing the coalescence rates on a grid of steps twice as fine and throwing a warning if the resulting values do not match to a relative tolerance of 0.001.
 Parameters
steps (list) – The times ago at which coalescence rates will be computed.
lineages (dict) – A mapping of populations (either integer IDs or string names: see the Identifying populations section for more details) to the number of monoploid sample lineages in that population.
min_pop_size (int) – The smallest allowed population size during computation of coalescent rates (i.e., coalescence rates are actually 1 / (2 * max(min_pop_size, N(z,t))). Spurious very small population sizes can occur in models where populations grow exponentially but are unused before some time in the past, and lead to floating point error. This should be set to a value smaller than the smallest desired population size in the model.
double_step_validation (bool) – Whether to perform the check that step sizes are sufficiently small, as described above. This is highly recommended, and will take at most four times the computation.
 Returns
A tuple of arrays whose jth elements, respectively, are the coalescence rate at the jth time point (denoted r(t[j]) above), and the probablility that a randomly chosen pair of lineages has not yet coalesced (denoted p(t[j]) above).
 Return type

Likelihoods¶

msprime.
log_arg_likelihood
(ts, recombination_rate, Ne=1)[source]¶ Returns the log probability of the stored tree sequence under the Hudson ARG. An exact expression for this probability is given in equation (1) of Kuhner et al. (2000).
We assume branch lengths stored in generations, resulting in a coalescence rate of \(1 / (2 N_e)\) per pair of lineages.
Warning
The stored tree sequence must store the full realisation of the ARG, including all recombination events and all common ancestor events, regardless of whether the recombinations cause a change in the ancestral tree or whether the common ancestor events cause coalescence of ancestral material. See Ancestral recombination graph for details of this data structure, and how to generate them using
msprime
.Warning
This method only supports continuous genomes. See Discrete or continuous? for how these can be specified when simulating tree sequences using
msprime
. Parameters
ts (tskit.TreeSequence) – The tree sequence object.
recombination_rate (float) – The perlink, pergeneration recombination probability. Must be nonnegative.
Ne (float) – The diploid effective population size.
 Returns
The log probability of the tree sequence under the Hudson ancestral recombination graph model. If the recombination rate is zero and the tree sequence contains at least one recombination event, then returns DBL_MAX.

msprime.
log_mutation_likelihood
(ts, mutation_rate)[source]¶ Returns the unnormalised log probability of the stored pattern of mutations on the stored tree sequence, assuming infinite sites mutation. In particular, each stored site must only contain a single mutation, and all mutant alleles are treated interchangeably as having arisen from a single mutation rate. This is automatically true for some msprime mutation models, such as JC69. The omitted normalising constant depends on the pattern of mutations, but not on the tree sequence or the mutation rate.
Warning
The infinite sites assumption means that this method is only valid for continuous genomes. It can also be run on discrete genomes, for which it may provide a reasonable approximation when the number of sites is large and the mutation rate is low. See Discrete or continuous? for how continuous genomes can be specified when simulating tree sequences using msprime.
The function first computes the probability of the overall number of mutations \(M\) from the Poisson probability mass function
\[e^{T \mu / 2} \frac{(T \mu / 2)^M}{M!},\]where \(T\) is the total area of ancestral material in the tree sequence stored in units of generations, and \(\mu\) is the persite, pergeneration mutation probability. Each mutation then contributes an individual factor of \(l / T\), where \(l\) is the total branch length on which the mutation could have arisen while appearing on all of the required lineages, again stored in generations.
Warning
If a tree at the site of a mutation contains unary nodes, then \(l\) could span more than one edge. In particular, we do not constrain mutations to take place on the edge directly above the node on which they have been recorded, but rather on any edge which would yield the same configuration of SNPs at the leaves of the tree sequence.
 Parameters
ts (tskit.TreeSequence) – The tree sequence object with mutations.
mutation_rate (float) – The persite, pergeneration mutation probablity. Must be nonnegative.
 Returns
The unnormalised log probability of the observed SNPs given the tree sequence. If the mutation rate is set to zero and the tree sequence contains at least one mutation, then returns float(“inf”).