API Reference#
This lists the detailed reference documentation for the msprime Python API. The reference documentation aims to be concise, precise and exhaustive; as such, it is not the best place to start if you are new to a particular piece of functionality. Please see the top-level documentation for Ancestry simulations, Mutation simulations or Demographic models for discussion and examples of individual features.
Summary#
Ancestry#
|
Simulates an ancestral process described by the specified model, demography and samples, and return a |
|
Specify a set of exchangeable sample individuals with a given ploidy value from a population at a given time. |
|
The classical coalescent with recombination model (i.e., Hudson's algorithm). |
|
The Sequentially Markov Coalescent (SMC) model defined by McVean and Cardin (2005). |
|
The SMC' model defined by Marjoram and Wall (2006) as a refinement of the |
|
A discrete backwards-time Wright-Fisher model, with diploid back-and-forth recombination. |
|
Backwards-time simulations through a pre-specified pedigree, with diploid individuals and back-and-forth recombination. |
|
A Lambda-coalescent with multiple mergers in the haploid cases, or a Xi-coalescent with simultaneous multiple mergers in the polyploid case. |
|
A Lambda-coalescent with multiple mergers in the haploid cases, or a Xi-coalescent with simultaneous multiple mergers in the polyploid case. |
|
A selective sweep that has occurred in the history of the sample. |
Mutations#
|
Simulates mutations on the specified ancestry and returns the resulting |
|
Jukes-Cantor mutation model (Jukes and Cantor 1969). |
|
The Hasegawa, Kishino and Yano mutation model (Hasegawa et al. 1985). |
|
The F84 mutation model (Felsenstein and Churchill, 1996). |
|
The Generalised Time-Reversible nucleotide mutation model, a general parameterisation of a time-reversible mutation process (Tavaré et al. 1986). |
|
The BLOSUM62 model of time-reversible amino acid mutation. |
|
The PAM model of time-reversible amino acid mutation. |
|
Basic binary mutation model with two alleles: |
Superclass of the specific mutation models with a finite set of states. |
|
|
General class for Microsatellite mutation models which parameterizes the transition matrix describing changes among alleles using five parameters according to the model of Sainudiin et al. (2004). |
|
Stepwise mutation model (Ohta and Kimura, 1978), for microsatellite repeat number. |
|
Two-phase mutation model of DiRienzo et al. (1994). |
|
|
An infinite alleles model of mutation in which each allele is a unique positive integer. |
|
An infinite-alleles model of mutation producing "SLiM-style" mutations. |
Demography#
|
A single population in a |
|
The definition of a demographic model for an msprime simulation, consisting of a set of populations, a migration matrix, and a list of demographic events. |
|
Utilities to compute and display information about the state of populations during the different simulation epochs defined by demographic events. |
Utilities#
|
A class mapping a non-negative rate value to a set of non-overlapping intervals along the genome. |
|
Utility for building pedigrees in the format required for input to the |
|
Parse a text file describing a pedigree used for input to the |
Reference documentation#
Ancestry#
- msprime.sim_ancestry(samples=None, *, demography=None, sequence_length=None, discrete_genome=None, recombination_rate=None, gene_conversion_rate=None, gene_conversion_tract_length=None, population_size=None, ploidy=None, model=None, initial_state=None, start_time=None, end_time=None, record_migrations=None, record_full_arg=None, additional_nodes=None, coalescing_segments_only=None, num_labels=None, random_seed=None, num_replicates=None, replicate_index=None, record_provenance=None)[source]#
Simulates an ancestral process described by the specified model, demography and samples, and return a
tskit.TreeSequence
(or a sequence of replicate tree sequences).- Parameters:
samples – The sampled individuals as either an integer, specifying the number of individuals to sample in a single-population model; or a list of
SampleSet
objects defining the properties of groups of similar samples; or as a mapping in which the keys are population identifiers (either an integer ID or string name) and the values are the number of samples to take from the corresponding population at its default sampling time. It is important to note that samples correspond to individuals here, and each sampled individual is usually associated with \(k\) sample nodes (or genomes) whenploidy
= \(k\). See the Specifying samples section for further details. Eithersamples
orinitial_state
must be specified.demography – The demographic model to simulate, describing the extant and ancestral populations, their population sizes and growth rates, their migration rates, and demographic events affecting the populations over time. See the Demographic models section for details on how to specify demographic models and Specifying samples for details on how to specify the populations that samples are drawn from. If not specified (or None) we default to a single population with constant size 1 (see also the
population_size
parameter).ploidy (int) – The number of monoploid genomes per sample individual (Default=2). See the Ploidy section for usage examples.
sequence_length (float) – The length of the genome sequence to simulate. See the Sequence length section for usage examples for this parameter and how it interacts with other parameters.
discrete_genome (bool) – If True (the default) simulation occurs in discrete genome coordinates such that recombination and gene conversion breakpoints always occur at integer positions. Thus, multiple (e.g.) recombinations can occur at the same genome position. If
discrete_genome
is False simulations are performed using continuous genome coordinates. In this case multiple events at precisely the same genome location are very unlikely (but technically possible). See the Discrete or continuous? section for usage examples.recombination_rate – The rate of recombination along the sequence; can be either a single value (specifying a single rate over the entire sequence) or an instance of
RateMap
. See the Recombination section for usage examples for this parameter and how it interacts with other parameters.gene_conversion_rate – The rate of gene conversion along the sequence. If provided, a value for
gene_conversion_tract_length
must also be specified. See the Gene conversion section for usage examples for this parameter and how it interacts with other parameters.gene_conversion_tract_length – The mean length of the gene conversion tracts. For discrete genomes the tract lengths are geometrically distributed with mean
gene_conversion_tract_length
, which must be greater than or equal to 1. For continuous genomes the tract lengths are exponentially distributed with meangene_conversion_tract_length
, which must be larger than 0.population_size – The number of individuals of the default single population
Demography
. If not specified, defaults to 1. Cannot be specified along with thedemography
parameter. See the Demographic models section for more details on demographic models and population sizes and the Demography section for usage examples.random_seed (int) – The random seed. If this is not specified or None, a high-quality random seed will be automatically generated. Valid random seeds must be between 1 and \(2^{32} - 1\). See the Random seeds section for usage examples.
num_replicates (int) – The number of replicates of the specified parameters to simulate. If this is not specified or None, no replication is performed and a
tskit.TreeSequence
object returned. If num_replicates is provided, the specified number of replicates is performed, and an iterator over the resultingtskit.TreeSequence
objects returned. See the Running replicate simulations section for examples.record_full_arg (bool) – If True, record all intermediate nodes arising from common ancestor and recombination events in the output tree sequence. This will result in unary nodes (i.e., nodes in marginal trees that have only one child). Defaults to False. See the Ancestral recombination graph section for examples.
additional_nodes (NodeType) – Retain all ancestry for any node of the specified type. This will result in unary nodes. Defaults to class msprime.NodeType(0).
coalescing_segments_only (bool) – If False, retain all ancestry for any nodes that are coalescent nodes anywhere in the sequence. This will result in unary nodes. Should be set to False when recording additional nodes. Defaults to True.
record_migrations (bool) – If True, record all migration events that occur in the Migration Table of the output tree sequence. Defaults to False. See the Migration events section for examples.
initial_state (tskit.TreeSequence) – If specified, initialise the simulation from the root segments of this tree sequence and return the completed tree sequence. Please see Specifying the initial state for details of the required properties of this tree sequence and its interactions with other parameters. All information in the
initial_state
tables is preserved (including metadata) and included in the returned tree sequence. (Default: None).start_time (float) – If specified, set the initial time that the simulation starts to this value. If not specified, the start time is zero if performing a simulation of a set of samples, or is the time of the oldest node if simulating from an existing tree sequence (see the
initial_state
parameter). See the Setting the start time section for examples.end_time (float) – If specified, terminate the simulation at the specified time. In the returned tree sequence, all rootward paths from samples with time <
end_time
will end in a node with one child with time equal to end_time. Any sample nodes with time >=end_time
will also be present in the output tree sequence. If not specified orNone
, run the simulation until all samples have an MRCA at all positions in the genome. See the Stopping simulations early section for examples.record_provenance (bool) – If True (the default), record all input parameters in the tree sequence Provenance.
model (str or msprime.AncestryModel or list) – The ancestry model to use. This can be either a single instance of
AncestryModel
(or a string that can be interpreted as an ancestry model), or a list ofAncestryModel
instances. If theduration
attribute of any of these models is set, the simulation will be run until at most \(t + t_m\), where \(t\) is the simulation time when the model starts and \(t_m\) is the model’sduration
. If theduration
is not set, the simulation will continue until the model completes, the overallend_time
is reached, or overall coalescence. See the Specifying ancestry models section for more details, and the Models section for the available models and examples.
- Returns:
The
tskit.TreeSequence
object representing the results of the simulation if no replication is performed, or an iterator over the independent replicates simulated if the num_replicates parameter has been used.- Return type:
tskit.TreeSequence
or an iterator overtskit.TreeSequence
replicates.
- class msprime.SampleSet(num_samples, population=None, time=None, ploidy=None)[source]#
Specify a set of exchangeable sample individuals with a given ploidy value from a population at a given time. See the Specifying samples section for details and examples.
- num_samples#
The number of k-ploid sample individuals to draw.
- population = None#
The population in which the samples are drawn. May be either a string name or integer ID (see Identifying populations details).
- time = None#
The time at which these samples are drawn. If not specified or None, defaults to the
Population.default_sampling_time
.
- class msprime.NodeType(value)[source]#
Specify the type of node for which you want to track ancestry. Extends the
enum.Flag
class. The NodeType can be specified by means of bitwise operators on the members of the enumeration. See additional nodes for a definition of each of the different node types.- RECOMBINANT = msprime.NodeType(1<<17)#
- COMMON_ANCESTOR = msprime.NodeType(1<<18)#
- MIGRANT = msprime.NodeType(1<<19)#
- CENSUS = msprime.NodeType(1<<20)#
- GENE_CONVERSION = msprime.NodeType(1<<21)#
- PASS_THROUGH = msprime.NodeType(1<<22)#
Models#
- class msprime.AncestryModel(*, duration=None)[source]#
Abstract superclass of all ancestry models.
- duration#
The time duration that this model should run for. If None, the model will run until completion (i.e., until the simulation coalesces or the model itself completes). Otherwise, this defines the maximum time duration which the model can run. See the Specifying ancestry models section for more details.
- class msprime.StandardCoalescent(*, duration=None)[source]#
The classical coalescent with recombination model (i.e., Hudson’s algorithm). The string
"hudson"
can be used to refer to this model.This is a continuous time model in which the time to the next event is exponentially distributed with rates depending on the population size(s), migration rates, numbers of extant lineages and the amount of ancestral material currently present. See Kelleher et al. (2016) for a detailed description of the model and further references.
- class msprime.SmcApproxCoalescent(*, duration=None)[source]#
The Sequentially Markov Coalescent (SMC) model defined by McVean and Cardin (2005). In the SMC, only common ancestor events that result in marginal coalescences are possible. Under this approximation, the marginal trees along the genome depend only on the immediately previous tree (i.e. are Markovian).
Note
This model is implemented using a naive rejection sampling approach and so it may not be any more efficient to simulate than the standard Hudson model.
The string
"smc"
can be used to refer to this model.
- class msprime.SmcPrimeApproxCoalescent(*, duration=None)[source]#
The SMC’ model defined by Marjoram and Wall (2006) as a refinement of the
SMC
. The SMC’ extends the SMC by additionally allowing common ancestor events that join contiguous tracts of ancestral material (as well as events that result in marginal coalescences).Note
This model is implemented using a naive rejection sampling approach and so it may not be any more efficient to simulate than the standard Hudson model.
The string
"smc_prime"
can be used to refer to this model.
- class msprime.DiscreteTimeWrightFisher(*, duration=None)[source]#
A discrete backwards-time Wright-Fisher model, with diploid back-and-forth recombination. The string
"dtwf"
can be used to refer to this model.Wright-Fisher simulations are performed very similarly to coalescent simulations, with all parameters denoting the same quantities in both models. Because events occur at discrete times however, the order in which they occur matters. Each generation consists of the following ordered events:
Migration events. As in the Hudson coalescent, these move single extant lineages between populations. Because migration events occur before lineages choose parents, migrant lineages choose parents from their new population in the same generation.
Demographic events. All events with previous_generation < event_time <= current_generation are carried out here.
Lineages draw parents. Each (monoploid) extant lineage draws a parent from their current population.
Diploid recombination. Each parent is diploid, so all child lineages recombine back-and-forth into the same two parental genome copies. These become two independent lineages in the next generation.
Historical sampling events. All historical samples with previous_generation < sample_time <= current_generation are inserted.
- class msprime.FixedPedigree(*, duration=None)[source]#
Backwards-time simulations through a pre-specified pedigree, with diploid individuals and back-and-forth recombination. The string
"fixed_pedigree"
can be used to refer to this model.The model has no parameters, but requires that the initial_state parameter is provided to
sim_ancestry()
and that it contains a valid pedigree. See the Pedigrees section for more information on the required encoding and methods for translating existing pedigrees into this format.Unlike other ancestry models, the
FixedPedigree
cannot be combined with other simulation models, and special attention must be paid to the role of founders when recapitating these simulations.See the Fixed pedigree section for more information for more information and examples of running simulations within a fixed pedigree.
- class msprime.BetaCoalescent(*, duration=None, alpha=None, truncation_point=1.7976931348623157e+308)[source]#
A Lambda-coalescent with multiple mergers in the haploid cases, or a Xi-coalescent with simultaneous multiple mergers in the polyploid case.
There are two main differences between the Beta-coalescent and the standard coalescent. Firstly, the number of lineages that take part in each common ancestor event is random, with distribution determined by moments of the \(Beta(2 - \alpha, \alpha)\)-distribution. In particular, when there are \(n\) lineages, each set of \(k \leq n\) of them participates in a common ancestor event at rate
\[\frac{1}{B(2 - \alpha, \alpha)} \int_0^1 x^{k - \alpha - 1} (1 - x)^{n - k + \alpha - 1} dx,\]where \(B(2 - \alpha, \alpha)\) is the Beta-function.
If ploidy = 1, then all participating lineages merge into one common ancestor, corresponding to haploid, single-parent reproduction. If ploidy = \(p > 1\), all participating lineages split randomly into \(2 p\) groups, corresponding to two-parent reproduction with \(p\) copies of each chromosome per parent. All lineages within each group merge simultaneously.
Secondly, the number of generations between common ancestor events predicted by the Beta-coalescent is proportional to \(N^{\alpha - 1}\), where \(N\) is the population size. Specifically, the mean number of generations until two lineages undergo a common ancestor event is
\[G = \frac{m^{\alpha} N^{\alpha - 1}}{\alpha B(2 - \alpha, \alpha)},\]if ploidy = 1, and
\[G = \frac{2 p m^{\alpha} (N / 2)^{\alpha - 1}} {\alpha B(2 - \alpha, \alpha)},\]if ploidy = \(p > 1\), where \(m\) is the mean number of juveniles per family given by
\[m = 2 + \frac{2^{\alpha}}{3^{\alpha - 1} (\alpha - 1)},\]if ploidy > 1, and
\[m = 1 + \frac{1}{2^{\alpha - 1} (\alpha - 1)},\]if ploidy = 1.
In the polyploid case we divide the population size \(N\) by two because we assume the \(N\) polyploid individuals form \(N / 2\) two-parent families in which reproduction takes place.
Warning
The number of generations between common ancestor events \(G\) depends both on the population size \(N\) and \(\alpha\), and can be dramatically shorter than in the case of the standard coalescent. For \(\alpha \approx 1\) that is due to insensitivity of \(G\) to \(N\) — see Multiple merger coalescents for an illustration. For \(\alpha \approx 2\), \(G\) is almost linear in \(N\), but can nevertheless be small because \(B(2 - \alpha, \alpha) \rightarrow \infty\) as \(\alpha \rightarrow 2\). As a result, population sizes must often be many orders of magnitude larger than census population sizes to obtain realistic amounts of diversity in simulated samples.
See Schweinsberg (2003) for the derivation of the common ancestor event rate, as well as the number of generations between common ancestor events. Note however that Schweinsberg (2003) only covers the haploid case. For details of the diploid extension, see Blath et al. (2013), and Birkner et al. (2018) for a diploid version of the Schweinsberg (2003) model specifically. The general polyploid model is analogous to the diploid case, with \(2 p\) available copies of parental chromosomes per common ancestor event, and hence up to \(2 p\) simultaneous mergers.
- Parameters:
alpha (float) – Determines the degree of skewness in the family size distribution, and must satisfy \(1 < \alpha < 2\). Smaller values of \(\alpha\) correspond to greater skewness, and \(\alpha = 2\) would coincide with the standard coalescent.
truncation_point (float) – The maximum number of juveniles \(K\) born to one family as a fraction of the population size \(N\). Must satisfy \(0 < K \leq \inf\). Determines the maximum fraction of the population replaced by offspring in one reproduction event, \(\tau\), via \(\tau = K / (K + m)\), where \(m\) is the mean juvenile number above. The default is \(K = \inf\), which corresponds to the standard Beta-coalescent with \(\tau = 1\). When \(K < \inf\), the number of lineages participating in a common ancestor event is determined by moments of the Beta:math:(2 - alpha, alpha) distribution conditioned on not exceeding \(\tau\), and the Beta-function in the expression for \(G\) is replaced by the incomplete Beta-function \(B(\tau; 2 - \alpha, \alpha)\).
- class msprime.DiracCoalescent(*, duration=None, psi=None, c=None)[source]#
A Lambda-coalescent with multiple mergers in the haploid cases, or a Xi-coalescent with simultaneous multiple mergers in the polyploid case.
The Dirac-coalescent is an implementation of the model of Blath et al. (2013) The simulation proceeds similarly to the standard coalescent. In addition to binary common ancestor events at rate \(n (n - 1) / 2\) when there are \(n\) lineages, potential multiple merger events take place at rate \(c > 0\). Each lineage participates in each multiple merger event independently with probability \(0 < \psi \leq 1\).
If ploidy = 1, then all participating lineages merge into one common ancestor, corresponding to haploid, single-parent reproduction. If ploidy = \(p > 1\), all participating lineages split randomly into \(2 p\) groups, corresponding to two-parent reproduction with \(p\) copies of each chromosome per parent. All lineages within each group merge simultaneously.
Warning
The Dirac-coalescent is obtained as a scaling limit of Moran models, rather than Wright-Fisher models. As a consequence, the number of generations between coalescence events is proportional to \(N^2\), rather than \(N\) generations as in the standard coalescent. See Multiple merger coalescents for an illustration of how this affects simulation output in practice.
- Parameters:
c (float) – Determines the rate of potential multiple merger events. We require \(c > 0\).
psi (float) – Determines the fraction of the population replaced by offspring in one large reproduction event, i.e. one reproduction event giving rise to potential multiple mergers when viewed backwards in time. We require \(0 < \psi \leq 1\).
- class msprime.SweepGenicSelection(*, duration=None, position=None, start_frequency=None, end_frequency=None, s=None, dt=None)[source]#
A selective sweep that has occurred in the history of the sample. This will lead to a burst of rapid coalescence near the selected site.
The strength of selection during the sweep is determined by the parameter \(s\). Here we define s such that the fitness of the three genotypes at our beneficial locus are \(W_{bb}=1\), \(W_{Bb}=1 + s/2\), \(W_{BB}=1 + s\). Thus fitness of the heterozygote is intermediate to the two homozygotes.
Warning
Models where start_frequency is \(\geq \frac{1}{ploidy*N}\) will not simulate the trajectory until its origination. In practical terms, this issue means that “sweeps from neutral standing genetic variation” are currently not possible. The tracking issue for this on GitHub is here.
The model is one of a structured coalescent where selective backgrounds are defined as in Braverman et al. (1995) The implementation details here follow closely those in discoal (Kern and Schrider, 2016), with the important difference that \(s\) in msprime is half of of that in discoal (i.e. for equivalent results use \(2s\)).
See Selective sweeps for examples and details on how to specify different types of sweeps.
Warning
Currently models with more than one population and a selective sweep are not implemented. Population size changes during the sweep are not yet possible in msprime.
- Parameters:
position (float) – the location of the beneficial allele along the chromosome.
start_frequency (float) – population frequency of the beneficial allele at the start of the selective sweep. E.g., for a de novo allele in a diploid population of size N, start frequency would be \(1/2N\).
end_frequency (float) – population frequency of the beneficial allele at the end of the selective sweep.
s (float) – \(s\) is the selection coefficient of the beneficial mutation.
dt (float) – dt is the small increment of time for stepping through the sweep phase of the model. a good rule of thumb is for this to be approximately \(1/40N\) or smaller.
Mutations#
- msprime.sim_mutations(tree_sequence, rate=None, *, random_seed=None, model=None, start_time=None, end_time=None, discrete_genome=None, keep=None, record_provenance=True)[source]#
Simulates mutations on the specified ancestry and returns the resulting
tskit.TreeSequence
. Mutations are generated at the specified rate per unit of sequence length, per unit of time. By default, mutations are generated at discrete sites along the genome and multiple mutations can occur at any given site. A continuous sequence, infinite-sites model can also be specified by setting thediscrete_genome
parameter to False.If the
model
parameter is specified, this determines the model of sequence evolution under which mutations are generated. The default mutation model is themsprime.JC69
, a symmetrical mutation model among the ACGT alleles. See the Models section for details of available models.If a
random_seed
is specified, this is used to seed the random number generator. If the same seed is specified and all other parameters are equal then the same mutations will be generated. If no random seed is specified then one is generated automatically.The time interval over which mutations can occur may be controlled using the
start_time
andend_time
parameters. Thestart_time
defines the lower bound (in time-ago) on this interval andmax_time
the upper bound. Note that we may have mutations associated with nodes with time <=start_time
since mutations store the node at the bottom (i.e., towards the leaves) of the branch that they occur on.If the tree sequence already has mutations, these are by default retained, but can be discarded by passing
keep=False
. However, adding new mutations to a tree sequence with existing mutations must be done with caution, since it can lead to incorrect or nonsensical results if mutation probabilities differ by ancestral state. (As an extreme example, suppose that X->Y and X->Z are allowable transitions, but Y->Z is not. If a branch already has an X->Y mutation on it, then calling sim_mutations(…, keep=True) might insert an X->Z mutation above the existing mutation, thus implying the impossible chain X->Y->Z.) However, the effect on nucleotide models of mutation are generally very small.Note
Many mutation models will insert silent transitions (e.g., placing a mutation to A above an existing mutation to A). Such mutations are harmless and are required for us to guarantee the statistical properties of the process of sequentially adding mutations to a tree sequence.
- Parameters:
tree_sequence (tskit.TreeSequence) – The tree sequence we wish to throw mutations onto.
rate (float) – The rate of mutation per unit of sequence length per unit time, as either a single number (for a uniform rate) or as a
RateMap
. (Default: 0).random_seed (int) – The random seed. If this is None, a random seed will be automatically generated. Valid random seeds must be between 1 and \(2^{32} - 1\). See the Random seeds section for usage examples.
model (MutationModel) – The mutation model to use when generating mutations. This can either be a string (e.g.,
"jc69"
) or an instance of a simulation model class e.g,msprime.F84(kappa=0.5)
. If not specified or None, theJC69
mutation model is used. Please see the Models section for more details on specifying mutation models.start_time (float) – The minimum time ago at which a mutation can occur. (Default: no restriction.)
end_time (float) – The maximum time ago at which a mutation can occur (Default: no restriction).
discrete_genome (bool) – Whether to generate mutations at only integer positions along the genome (Default=True).
keep (bool) – Whether to keep existing mutations. (default: True)
record_provenance (bool) – If True, record all input parameters in the tree sequence Provenance.
- Returns:
The
tskit.TreeSequence
object resulting from overlaying mutations on the input tree sequence.- Return type:
Models#
- class msprime.MatrixMutationModel[source]#
Superclass of the specific mutation models with a finite set of states. This class can be used to create a matrix mutation model with custom alleles and probabilities. For details of how this works see Mutation Matrix Models Details.
- Parameters:
alleles (list[str]) – A list of the possible alleles generated by the model. Each entry is a string.
root_distribution (list[float]) – An array of float values the same length as
alleles
which should sum to 1. These values are used to determine the ancestral state of each mutated site. For example ifalleles
is['A','B']
androot_distribution
is[1,0]
all ancestral states will be'A'
.transition_matrix (list[float]) – A square 2d-matrix of transition probabilities where both dimensions are equal to the length of the
alleles
argument. These are used to determine the derived state at each mutation. Note that rows should sum to 1.
- class msprime.MicrosatMutationModel[source]#
General class for Microsatellite mutation models which parameterizes the transition matrix describing changes among alleles using five parameters according to the model of Sainudiin et al. (2004)
See also
See the Microsatellite Mutation Models section for more details and examples.
Concretely, mutation rates under this model are computed as follows. Let
qij
be the transition rate from equation (3) in Sainudiin et al, and letM
be the largest row sum of the matrix whose i,j-th entry isqij
and whose diagonals are zero. Then if the mutation rate passed tosim_mutations()
ism
, the mutation rate from allelei
to allelej
ism * qij / M
.Note that the values of
lo
andhi
are in units of repeat number, and are unrelated to repeat length (e.g., the repeat itself can be dinucleotide, tri-nucleotide, etcetera).Warning
The default values for
lo
andhi
are chosen arbitrarily. The default oflo=1
is chosen so that all states might be represented, but please note that in empirical data, microsatellite loci are often ascertained based on the number of repeats observed. Please ensure that you setlo
andhi
to the appropriate values for your organism / locus of interest.- Parameters:
s (float) – strength of length dependence on mutation rate. Defaults to 0, i.e., no length dependence.
u (float) – Constant bias parameter; must be between 0 and 1. Defaults to 0.5, i.e., no bias.
v (float) – Linear bias parameter. Defaults to 0, i.e., no bias.
p (float) – Probability of a single step mutation. Defaults to 1.
m (float) – 1 / (mean multistep mutation size). Defaults to 1.
lo (int) – Lower bound on repeat number. Defaults to 2.
hi (int) – Upper bound on repeat number (inclusive). Defaults to 50.
root_distribution (list[float]) – An array of float values of length hi-lo+1 which should sum to 1, and give the probabilities used to choose the ancestral state of each microsatellite. Defaults to the stationary distribution of the model.
- class msprime.SMM[source]#
Stepwise mutation model (Ohta and Kimura, 1978), for microsatellite repeat number.
This is a
MicrosatMutationModel
with alleles[lo, .. , hi]
, a root distribution (stationary distribution by default), and a transition matrix that allows only mutations that change the number of repeats by +/- 1. Concretely, if the mutation rate ism
, then an allelek
mutates tok+1
andk-1
at ratem/2
each, except that mutations abovehi
or belowlo
result in no change.- Parameters:
lo (int) – Repeat number lower bound. See the documentation for
MicrosatMutationModel
for details.hi (int) – Repeat number upper bound. See the documentation for
MicrosatMutationModel
for details.root_distribution (list[float]) – See the documentation for
MicrosatMutationModel
for details.
- class msprime.TPM[source]#
Two-phase mutation model of DiRienzo et al. (1994)
This models evolution of microsatellite repeat number in a manner that allows for multi-step mutations in copy number of microsatellite repeats. This is parameterized by p the probability of a single step mutation, and m the success probability of the truncated gamma distribution describing the distribution of longer steps such that the mean multi-step mutation is length 1/m. For this model both p and m need to be set to < 1.0.
This is a
MicrosatMutationModel
with alleles[lo, .. , hi]
. For a precise definition of the mutation rates, seeMicrosatMutationModel
withs=0
,u=0.5
, andv=0
. SeeMicrosatMutationModel
for further details of parameterization.- Parameters:
p (float) – Probability of a single step mutation.
m (float) – Success prob of truncated gamma distribution.
lo (int) – Repeat number lower bound. See the documentation for
MicrosatMutationModel
for details.hi (int) – Repeat number upper bound. See the documentation for
MicrosatMutationModel
for details.root_distribution (list[float]) – See the documentation for
MicrosatMutationModel
for details.
- class msprime.EL2[source]#
Equal rate, linear biased, two-phase mutation model of Garza et al. (1995) as parameterized by Sainudiin et al. (2004)
This models evolution of microsatellite repeat number in a manner that allows for multi-step mutations in copy number. This is parameterized by m the success probability of the truncated gamma distribution describing the distribution of longer steps, such that the mean multi-step mutation is length 1/m. u the constant bias parameter which determines bias (if any) in expansion or contraction of the repeat number, and v the linear bias parameter which determines mutation rate variation associated with repeat number. For this model m need to be set to < 1.0, and u needs to be set to between 0 and 1. and v can take any real, positive or negative value.
See
MicrosatMutationModel
for details on the constant bias parameter u and a linear bias parameter v.This is a
MicrosatMutationModel
with alleles[lo, .. , hi]
For a precise definition of the mutation rates, seeMicrosatMutationModel
withs=0
.- Parameters:
m (float) – Success prob of truncated gamma distribution.
u (float) – Constant bias parameter.
v (float) – Linear bias parameter.
lo (int) – Repeat number lower bound. See the documentation for
MicrosatMutationModel
for details.hi (int) – Repeat number upper bound. See the documentation for
MicrosatMutationModel
for details.root_distribution (list[float]) – See the documentation for
MicrosatMutationModel
for details.
- class msprime.BinaryMutationModel[source]#
Basic binary mutation model with two alleles:
"0"
and"1"
, and ancestral allele always"0"
.This is a
MatrixMutationModel
with alleles["0", "1"]
, root distribution[1.0, 0.0]
, and transition matrix that is either[[0.0, 1.0], [1.0, 0.0]]
(by default), or[[0.5, 0.5], [0.5, 0.5]]
(ifstate_independent=True
).- Parameters:
state_independent (bool) – Whether mutations will be generated in a way independent of the previous allelic state, which includes silent mutations. See Existing mutations for more details.
- class msprime.JC69[source]#
Jukes-Cantor mutation model (Jukes and Cantor 1969). Based on the standard ACGT nucleotides as allelic states, this model assumes equal probabilities for ancestral state and equal probabilities for all possible transitions.
This is a
MatrixMutationModel
with alleles["A", "C", "G", "T"]
, root distribution[1/4, 1/4, 1/4, 1/4]
, and default transition matrix[[0, 1/3, 1/3, 1/3], [1/3, 0, 1/3, 1/3], [1/3, 1/3, 0, 1/3], [1/3, 1/3, 1/3, 0]]
. It has no free parameters.If
state_independent
is True, then instead all entries of the transition matrix will be 1/4, so one-quarter of the mutations will be silent. See Existing mutations for more details.Citation: Jukes TH, Cantor CR (1969). Evolution of Protein Molecules. pp. 21-132.
- Parameters:
state_independent (bool) – Whether mutations will be generated in a way independent of the previous allelic state.
- class msprime.HKY[source]#
The Hasegawa, Kishino and Yano mutation model (Hasegawa et al. 1985). Based on the standard ACGT nucleotides as allelic states, this model allows different rates for transitions and transversions, and sets an equilibrium frequency for each nucleotide. In addition a custom ancestral frequency (
root_distribution
) can be specified. Withkappa=1.0
and the default values of the other arguments this model is equal toJC69
. This model is similar toF84
but with a differing parameterisation forkappa
.This model is parameterised by \(\kappa\) (
kappa
), the ratio of transition to transversion mutation rates, and \(\pi\) (equilibrium_frequencies
), the vector of equilibrium nucleotide frequencies. If this mutation model is used with amutation_rate
of \(\mu\), then the mutation rate from the i-th to the j-th allele is equal to \(Q_{ij}\), where\(\mathbf{Q} = \frac{\mu}{M} \times \begin{bmatrix} \cdot & \pi_C & \kappa \pi_G & \pi_T \\ \pi_A & \cdot & \pi_G & \kappa \pi_T \\ \kappa \pi_A & \pi_C & \cdot & \pi_T \\ \pi_A & \kappa \pi_C & \pi_G & \cdot \end{bmatrix}\)
Here \(M\) is an overall scaling factor on the mutation rate that can be computed as follows: let \(q_{ij} = \pi_j\) if i<->j is a transversion, and \(q_{ij} = \kappa \pi_j\) otherwise. Set \(q_{ii} = 0\), and then let \(M\) be the largest row sum of \(q\), i.e., \(M = \max_i \left( \sum_j q_{ij} \right)\). Then, this implementation as a
MatrixMutationModel
has the transition matrix \(P_{ij} = q_{ij} / M\), and \(P_{ii} = 1 - \sum_{j\neq i} P_{ij}\).Note also that \(\kappa\) is the ratio of individual mutation rates, not the ratio of total transition to transversion rates, which would be \(\kappa/2\), as is used in some parameterisations.
If
state_independent
is true, then the above is modified by setting \(q_{ii} = \kappa \pi_i\). This makes the model completely state-independent only if \(\kappa = 1\) by adding additional silent mutations. See Existing mutations for more details.Citation: Hasegawa M, Kishino H, Yano T (1985). “Dating of the human-ape splitting by a molecular clock of mitochondrial DNA”. Journal of Molecular Evolution. 22 (2): 160–74.
Note that this implementation has the root distribution as a separate parameter, although it defaults to the equilibrium distribution. If you set the root distribution to be different than the equilibrium distribution, then you have a nonequilibrium model, and you should make sure that’s what you want.
- Parameters:
kappa (float) – Scaling parameter by which the transition matrix is modified for transitions (A<>G, C<>T). For example a value of
0
means that no transitions will occur,1
makes the probability of transitions and transversions equal.equilibrium_frequencies (list[float]) – An array of float values of length 4 which should sum to 1. These values are used to determine equilibrium frequencies (long-term averages) of the alleles. Defaults to [0.25, 0.25, 0.25, 0.25], i.e., all nucleotides equally likely.
root_distribution (list[float]) – An array of float values of length 4 which should sum to 1. These values are used to determine the ancestral state of each mutational site. Defaults to the value of
equilibrium_frequencies
.state_independent (bool) – Whether to include a higher rate of silent mutations that makes the model closer to state-independent.
- class msprime.F84[source]#
The F84 mutation model (Felsenstein and Churchill, 1996). Based on the standard ACGT nucleotides as allelic states, this model takes into account transitions and transversions, and sets an equilibrium frequency for each nucleotide. In addition a custom ancestral frequency (
root_distribution
) can be specified. Withkappa=1.0
and the default values of the other arguments this model is equal toJC69
. This model is similar toHKY
but with a differing parameterisation forkappa
.This model is parameterised by \(\kappa\) (
kappa
), the ratio of transition to transversion mutation rates, and \(\pi\) (equilibrium_frequencies
), the vector of equilibrium nucleotide frequencies. If this mutation model is used with amutation_rate
of \(\mu\), then the mutation rate from the i-th to the j-th allele is equal to \(Q_{ij}\), where\(\mathbf{Q} = \frac{\mu}{M} \times \begin{bmatrix} \cdot & \pi_C & \left(1 + \frac{\kappa-1}{\pi_A + \pi_G}\right) \pi_G & \pi_T \\ \pi_A & \cdot & \pi_G & \left(1 + \frac{\kappa-1}{\pi_C + \pi_T}\right) \pi_T \\ \left(1 + \frac{\kappa-1}{\pi_A + \pi_G}\right) \pi_A & \pi_C & \cdot & \pi_T \\ \pi_A & \left(1 + \frac{\kappa-1}{\pi_C + \pi_T}\right) \pi_C & \pi_G & \cdot \end{bmatrix}\)
Here \(M\) is an overall scaling factor on the mutation rate that can be computed as follows: let \(q_{ij} = \pi_j\) if i<->j is a transversion, \(q_{ij} = (1 + (\kappa-1)/(\pi_A + \pi_G)) \pi_j\) if i and j are A and G in some order, \(q_{ij} = (1 + (\kappa-1)/(\pi_C + \pi_T)) \pi_j\) if i and j are C and T in some order, and \(q_{ii} = 0\). Let \(M\) be the largest row sum of \(q\), i.e., \(M = \max_i \left( \sum_j q_{ij} \right)\). Then, this implementation as a
MatrixMutationModel
has the transition matrix \(P_{ij} = q_{ij} / M\), and \(P_{ii} = 1 - \sum_{j\neq i} P_{ij}\).If
state_independent
is true, then the above is modified by setting \(q_{ii} = \kappa \pi_i\). This makes the model completely state-independent only if \(\kappa = 1\) by adding additional silent mutations. See Existing mutations for more details.Citation: Felsenstein J, Churchill GA (January 1996). “A Hidden Markov Model approach to variation among sites in rate of evolution”. Molecular Biology and Evolution. 13 (1): 93–104.
Note that this implementation has the root distribution as a separate parameter, although it defaults to the equilibrium distribution. If you set the root distribution to be different than the equilibrium distribution, then you have a nonequilibrium model, and you should make sure that’s what you want.
- Parameters:
kappa (float) – Scaling parameter by which the transition matrix is modified for transitions (A<>G, C<>T). Must be greater than or equal to the larger of \((\pi_A+\pi_G)/(\pi_C+\pi_T)\) and \((\pi_C+\pi_T)/(\pi_A+\pi_G)\).
equilibrium_frequencies (list[float]) – An array of float values of length 4 which should sum to 1. These values are used to determine equilibrium frequencies (long-term averages) of the alleles. Defaults to [0.25, 0.25, 0.25, 0.25], i.e all nucleotides equally likely.
root_distribution (list[float]) – An array of float values of length 4 which should sum to 1. These values are used to determine the ancestral state of each mutational site. Defaults to the value of
equilibrium_frequencies
.state_independent (bool) – Whether to include a higher rate of silent mutations that makes the model closer to state-independent.
- class msprime.GTR[source]#
The Generalised Time-Reversible nucleotide mutation model, a general parameterisation of a time-reversible mutation process (Tavaré et al. 1986). It allows specification of per-nucleotide equilibrium frequencies and equilibrium transition rates.
This model is parameterised by the vector \(r\) (
relative_rates
), and \(\pi\) (equilibrium_frequencies
), the vector of equilibrium nucleotide frequencies. The entries ofrelative_rates
are, in this order, \((r_{AC}, r_{AG}, r_{AT}, r_{CG}, r_{CT}, r_{GT})\). If this mutation model is used with amutation_rate
of \(\mu\), then the mutation rate from the i-th to the j-th allele is \(Q_{ij}\), where\(\mathbf{Q} = \frac{\mu}{M} \times \begin{bmatrix} \cdot & r_{AC} \pi_C & r_{AG} \pi_G & r_{AT} \pi_T \\ r_{AC} \pi_A & \cdot & r_{CG} \pi_G & r_{CT} \pi_T \\ r_{AG} \pi_A & r_{CG} \pi_C & \cdot & r_{GT} \pi_T \\ r_{AT} \pi_A & r_{CT} \pi_C & r_{GT} \pi_G & \cdot \end{bmatrix}\)
Here \(M\) is an overall scaling factor on the mutation rate that can be computed as follows: let \(q_{ij} = r_{ij} \pi_j\) and \(q_{ii} = 0\), and then let \(M\) be the largest row sum of \(q\), i.e., \(M = \max_i \left( \sum_j q_{ij} \right)\). Then, this implementation as a
MatrixMutationModel
has the transition matrix \(P_{ij} = q_{ij} / M\), and \(P_{ii}\) chosen so rows sum to one.If
state_independent
is true, then the above is modified by setting \(q_{ii} = \pi_i\). This makes the model completely state-independent only if all \(r_{ij} = 1\) by adding additional silent mutations. See Existing mutations for more details.Citation: Tavaré S (1986). “Some Probabilistic and Statistical Problems in the Analysis of DNA Sequences”. Lectures on Mathematics in the Life Sciences. 17: 57–86.
Note that this implementation has the root distribution as a separate parameter, although it defaults to the equilibrium distribution. If you set the root distribution to be different than the equilibrium distribution, then you have a nonequilibrium model, and you should make sure that’s what you want.
- Parameters:
relative_rates (list[float]) – Controls the relative rates of mutation between each pair of nucleotides, in order “A<>C”, “A<>G”, “A<>T”, “C<>G”, “C<>T”, and “G<>T”.
equilibrium_frequencies (list[float]) – The equilibrium base frequencies, a list of four probabilities that sum to one. (Default: equal frequencies.)
root_distribution (list[float]) – An array of float values of length 4 which should sum to 1. These values are used to determine the ancestral state of each mutational site. Defaults to the value of
equilibrium_frequencies
.state_independent (bool) – Whether to include a higher rate of silent mutations that makes the model closer to state-independent.
- class msprime.BLOSUM62[source]#
The BLOSUM62 model of time-reversible amino acid mutation. This model has no free parameters.
The model is parameterised by a 20-by-20 symmetric matrix of relative rates, \(B\), and a vector of amino acid equilibrium frequencies, \(\pi\) (for the precise order, see this model’s .alleles attribute). If this mutation model is used with a
mutation_rate
of \(\mu\), then the mutation rate from the i-th to the j-th allele is \(Q_{ij}\), where\(Q_{ij} = \frac{\mu}{M} B_{ij} \pi_j,\)
where \(M\) is an overall scaling factor chosen so that the largest row sum in the matrix \(B_{ij} \pi_j /M\) is equal to one. In this model, \(M = 1.726203705809619\). This implementation as a
MatrixMutationModel
has transition matrix \(P_{ij} = B_{ij} \pi_{j} / M\), and \(P_{ii}\) chosen so rows sum to one, and root distribution equal to the equilibrium frequencies, \(\pi\), so the matrix \(B\) can be recovered as follows:model = msprime.BLOSUM62() M = 1.726203705809619 B = M * model.transition_matrix / model.root_distribution
Citation: The values of \(B\) and \(\pi\) used here were copied from Seq-Gen. The original citation is: Henikoff, S., and J. G. Henikoff. 1992. PNAS USA 89:10915-10919, and the numerical values were provided in: Yu,Y.-K., Wootton, J.C. and Altschul, S.F. (2003) The compositional adjustment of amino acid substitution matrices. Proc. Natl Acad. Sci., USA, 100, 15688–15693.
- class msprime.PAM[source]#
The PAM model of time-reversible amino acid mutation. This model has no free parameters.
The model is parameterised by a 20-by-20 symmetric matrix of relative rates, \(B\), and a vector of amino acid equilibrium frequencies, \(\pi\) (for the precise order, see this model’s .alleles attribute). If this mutation model is used with a
mutation_rate
of \(\mu\), then the mutation rate from the i-th to the j-th allele is \(Q_{ij}\), where\(Q_{ij} = \frac{\mu}{M} B_{ij} \pi_j,\)
where \(M\) is an overall scaling factor chosen so that the largest row sum in the matrix \(B_{ij} \pi_j /M\) is equal to one. In this model, \(M = 1.78314862248\). This implementation as a
MatrixMutationModel
has transition matrix \(P_{ij} = B_{ij} \pi_{j} / M\), and \(P_{ii}\) chosen so rows sum to one, and root distribution equal to the equilibrium frequencies, \(\pi\), so the matrix \(B\) can be recovered as follows:model = msprime.PAM() M = 1.78314862248 B = M * model.transition_matrix / model.root_distribution
Citation: The values of \(B\) and \(\pi\) used here were copied from Seq-Gen, and follow the “Dayhoff DCMut” model as described in Kosiol, C., and Goldman, N. 2005. Different versions of the Dayhoff rate matrix. Molecular Biology and Evolution 22:193-199. The original citation is: Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C. (1978). A model of evolutionary change in proteins. Atlas of Protein Sequence Structures, Vol. 5, Suppl. 3, National Biomedical Research Foundation, Washington DC, pp. 345-352.
- class msprime.InfiniteAlleles[source]#
An infinite alleles model of mutation in which each allele is a unique positive integer. This works by keeping track of a “next allele”: each time the model is asked to produce a new allele (either for an ancestral or derived state), the “next allele” is provided, and then incremented.
Variables:
next_allele: The next allele to be assigned. This increments by one each time an allele is assigned to a new mutation, and resets to
start_allele
each time a new ancestral state is assigned.- Parameters:
start_allele (int) – The nonnegative integer to start assigning alleles from. (default: 0)
- class msprime.SLiMMutationModel[source]#
An infinite-alleles model of mutation producing “SLiM-style” mutations.
To agree with mutations produced by SLiM, the ancestral state of each new site is set to the empty string, and each derived state is produced by appending the “next allele” to the previous state. The result is that each allele is a comma-separated string of all mutations that have occurred up to the root.
Mutations produced by SLiM carry both a
time
attribute, in units of “time ago” as usual for tskit, as well as a metadata attributed called “origin_generation”, in units of time since the start of the simulation. Adding these two together - time since the start of the simulation plus time until the end - is equal to the total number of generations of the simulation. The origin_generation is not currently used by SLiM, but for consistency, the origin_generation attribute for mutations produced by this model is set equal toslim_generation
minusfloor(mut.time)
, wheremut.time
is the (tskit) time ago of the mutation.- Parameters:
type (int) – The nonnegative integer defining the “type” of SLiM mutation that will be recorded in metadata.
next_id (int) – The nonnegative integer to start assigning alleles from. (default: 0)
slim_generation (int) – The “SLiM generation” time used in determining the “origin_generation” metadata attribute of mutations. This can usually be left at its default, which is 1.
block_size (int) – The block size for allocating derived states. You do not need to change this unless you get an “out of memory” error due to a very large number of stacked mutations.
Node flags#
In the tskit Node Table node flags specify
particular properties about nodes. Msprime follows the standard approach
of setting the tskit.NODE_IS_SAMPLE
flag for all sample nodes,
with all other nodes having a flags value of 0.
Msprime defines some extra flags that help us to identify particular nodes in some situations:
Note
These flags have been deprecated in favour of using
NodeType
. This is not a breaking change. The
constant associated with each flag remains the same. However, working
with flags should be more intuitive now that we are relying on the
enum.Flag
functionality.
- msprime.NODE_IS_RE_EVENT#
The node is an ARG recombination event. Each recombination event is marked with two nodes, one identifying the individual providing the genetic material to the left of the breakpoint and the other providing the genetic material the right. Present either with the
record_full_arg
flag or when adding recombination to theadditional_nodes
to be stored.
- msprime.NODE_IS_CA_EVENT#
The node is an ARG common ancestor event that did not result in marginal coalescence. Present either with the
record_full_arg
flag or when adding common ancestor events to theadditional_nodes
to be stored.
- msprime.NODE_IS_MIG_EVENT#
The node is an ARG migration event identifying the individual that migrated. Can be used in combination with the
record_migrations
option. Present either with therecord_full_arg
flag or when adding migration events to theadditional_nodes
to be stored.
- msprime.NODE_IS_CEN_EVENT#
The node was created by a census event. Please see the Census events section for more details.
- msprime.NODE_IS_GC_EVENT#
The node was created by a gene conversion event.
- msprime.NODE_IS_PASS_THROUGH#
The node identifies an ancestral genome/ploid through which the ancestral material of only a single lineage passed (so no coalescence or common ancestor event). Can be used in combination with the
record_migrations
option. Only present when storing pass through events using theadditional_nodes
option. And only compatible withDTWF
andFixedPedigree
models.
Rate maps#
- class msprime.RateMap(*, position, rate)[source]#
A class mapping a non-negative rate value to a set of non-overlapping intervals along the genome. Intervals for which the rate is unknown (i.e., missing data) are encoded by NaN values in the
rate
array.- Parameters:
- property left#
The left position of each interval (inclusive).
- property right#
The right position of each interval (exclusive).
- property mid#
Returns the midpoint of each interval.
- property span#
Returns the span (i.e.,
right - left
) of each of the intervals.
- property position#
The breakpoint positions between intervals. This is equal to the
left
array with thesequence_length
appended.
- property rate#
The rate associated with each interval. Missing data is encoded by NaN values.
- property mass#
The “mass” of each interval, defined as the
rate
\(\times\)span
. This is NaN for intervals containing missing data.
- property missing#
A boolean array encoding whether each interval contains missing data. Equivalent to
np.isnan(rate_map.rate)
- property non_missing#
A boolean array encoding whether each interval contains non-missing data. Equivalent to
np.logical_not(np.isnan(rate_map.rate))
- property num_intervals#
The total number of intervals in this map. Equal to
num_missing_intervals
+num_non_missing_intervals
.
- property num_missing_intervals#
Returns the number of missing intervals, i.e., those in which the
rate
value is a NaN.
- property num_non_missing_intervals#
The number of non missing intervals, i.e., those in which the
rate
value is not a NaN.
- property sequence_length#
The sequence length covered by this map
- property total_mass#
The cumulative total mass over the entire map.
- property mean_rate#
The mean rate over this map weighted by the span covered by each rate. Unknown intervals are excluded.
- get_rate(x)[source]#
Return the rate at the specified list of positions.
Note
This function will return a NaN value for any positions that contain missing data.
- Parameters:
x (numpy.ndarray) – The positions for which to return values.
- Returns:
An array of rates, the same length as
x
.- Return type:
- get_cumulative_mass(x)[source]#
Return the cumulative mass of the map up to (but not including) a given point for a list of positions along the map. This is equal to the integral of the rate from 0 to the point.
- Parameters:
x (numpy.ndarray) – The positions for which to return values.
- Returns:
An array of cumulative mass values, the same length as
x
- Return type:
- find_index(x)[source]#
Returns the index of the interval that the specified position falls within, such that
rate_map.left[index] <= x < self.rate_map.right[index]
.
- missing_intervals()[source]#
Returns the left and right coordinates of the intervals containing missing data in this map as a 2D numpy array with shape (
num_missing_intervals
, 2). Each row of this returned array is therefore aleft
,right
tuple corresponding to the coordinates of the missing intervals.- Returns:
A numpy array of the coordinates of intervals containing missing data.
- Return type:
- slice(left=None, right=None, *, trim=False)[source]#
Returns a subset of this rate map in the specified interval.
- Parameters:
left (float) – The left coordinate (inclusive) of the region to keep. If
None
, defaults to 0.right (float) – The right coordinate (exclusive) of the region to keep. If
None
, defaults to the sequence length.trim (bool) – If True, remove the flanking regions such that the sequence length of the new rate map is
right
-left
. IfFalse
(default), do not change the coordinate system and mark the flanking regions as “unknown”.
- Returns:
A new RateMap instance
- Return type:
- static read_hapmap(fileobj, sequence_length=None, *, has_header=True, position_col=None, rate_col=None, map_col=None)[source]#
Parses the specified file in HapMap format and returns a
RateMap
. HapMap files must white-space-delimited, and by default are assumed to contain a single header line (which is ignored). Each subsequent line then contains a physical position (in base pairs) and either a genetic map position (in centiMorgans) or a recombination rate (in centiMorgans per megabase). The value in the rate column in a given line gives the constant rate between the physical position in that line (inclusive) and the physical position on the next line (exclusive). By default, the second column of the file is taken as the physical position and the fourth column is taken as the genetic position, as seen in the following sample of the format:Chromosome Position(bp) Rate(cM/Mb) Map(cM) chr10 48232 0.1614 0.002664 chr10 48486 0.1589 0.002705 chr10 50009 0.159 0.002947 chr10 52147 0.1574 0.003287 ... chr10 133762002 3.358 181.129345 chr10 133766368 0.000 181.144008
In the example above, the first row has a nonzero genetic map position (last column, cM), implying a nonzero recombination rate before that position, that is assumed to extend to the start of the chromosome (at position 0 bp). However, if the first line has a nonzero bp position (second column) and a zero genetic map position (last column, cM), then the recombination rate before that position is unknown, producing missing data.
Note
The rows are all assumed to come from the same contig, and the first column is currently ignored. Therefore if you have a single file containing several contigs or chromosomes, you must must split it up into multiple files, and pass each one separately to this function.
- Parameters:
fileobj (str) – Filename or file to read. This is passed directly to
numpy.loadtxt()
, so if the filename extension is .gz or .bz2, the file is decompressed firstsequence_length (float) – The total length of the map. If
None
, then assume it is the last physical position listed in the file. Otherwise it must be greater then or equal to the last physical position in the file, and the region between the last physical position and the sequence_length is padded with a rate of zero.has_header (bool) – If True (default), assume the file has a header row and ignore the first line of the file.
position_col (int) – The zero-based index of the column in the file specifying the physical position in base pairs. If
None
(default) assume an index of 1 (i.e. the second column).rate_col (int) – The zero-based index of the column in the file specifying the rate in cM/Mb. If
None
(default) do not use the rate column, but calculate rates using the genetic map positions, as specified inmap_col
. If the rate column is used, the interval from 0 to first physical position in the file is marked as unknown, and the last value in the rate column must be zero.map_col (int) – The zero-based index of the column in the file specifying the genetic map position in centiMorgans. If
None
(default), assume an index of 3 (i.e. the fourth column). If the first genetic position is 0 the interval from position 0 to the first physical position in the file is marked as unknown. Otherwise, act as if an additional row, specifying physical position 0 and genetic position 0, exists at the start of the file.
- Returns:
A RateMap object.
- Return type:
Demography#
- class msprime.Demography[source]#
The definition of a demographic model for an msprime simulation, consisting of a set of populations, a migration matrix, and a list of demographic events. See the Demographic models section for detailed documentation on how to define, debug and simulate with demography in msprime.
Please see the Demography objects section for details of how to access and update population information within a model.
Demography objects implement the Python
collections.abc.Mapping
protocol, in which the keys are either the populationname
or integerid
values (see the Identifying populations section for more information) and the values arePopulation
objects.In general, population references in methods such as
Demography.add_population_split()
can either be string names or integer IDs, and the two forms can be used interchangeably.- add_population(*, initial_size, growth_rate=None, name=None, description=None, extra_metadata=None, default_sampling_time=None, initially_active=None)[source]#
Adds a new
Population
to thisDemography
with the specified parameters. The new population will have ID equal to the the number of populations immediately beforeadd_population
is called, such that the first population added has ID 0, the next ID 1 and so on. If thename
is not specified, this defaults to"pop_{id}"
. An Initial size value must be specified (but may be zero).- Parameters:
initial_size (float) – The number of individuals of the population at time zero. See the Initial size section for more details and examples.
growth_rate (float) – The exponential growth rate of the population. See the Growth rate section for more details and examples.
name (str) – The human-readable identifier for this population. If not specified, defaults to the string
"pop_{id}"
whereid
is the population’s integer ID. See Identifying populations for more details and recommendations on best practise.description (str) – A concise but informative description of what this population represents within the wider model. Defaults to the empty strings.
extra_metadata (dict) – Extra metadata to associate with this population that will be stored tree sequences output by
sim_ancestry()
. See the Metadata section for more details and examples.default_sampling_time (float) – The time at which samples will be taken from this population, if a time in not otherwise specified. By default this is determined by the details of the model, and whether populations are ancestral in Population split events. See the Default sampling time section for more details.
initially_active (bool) – Whether this population is initially active. By default this is determined by the details of the model, and whether populations are ancestral in Population split events. See the Life cycle section for more details.
- Returns:
The new
Population
instance.- Return type:
- set_migration_rate(source, dest, rate)[source]#
Sets the backwards-time rate of migration from the specified
source
population todest
to the specified value. This has the effect of settingdemography.migration_matrix[source, dest] = rate
. It is the rate at which a lineage currently insource
moves todest
as one follows the lineage back through time.Important
Note this is the backwards in time; migration rate and that
source
anddest
are from the perspective of lineages in the coalescent process. See Migration for more details and clarification on this vital point.The
source
anddest
populations can be referred to either by their integerid
or stringname
values.
- set_symmetric_migration_rate(populations, rate)[source]#
Sets the symmetric migration rate between all pairs of populations in the specified list to the specified value. For a given pair of population IDs
j
andk
, this setsdemography.migration_matrix[j, k] = rate
anddemography.migration_matrix[k, j] = rate
.Populations may be specified either by their integer IDs or by their string names.
- add_population_split(time, *, derived, ancestral)[source]#
Adds a population split event at the specified time. In a population split event all lineages from the (more recent) derived populations move to the (more ancient) ancestral population. Forwards in time, this corresponds to the ancestral population splitting into the derived populations.
See the Population split section for more details and examples.
In addition to moving lineages from the derived population(s) into the ancestral population, a population split has the following additional effects:
All derived populations are set to inactive.
All migration rates to and from the derived populations are set to 0.
Population sizes and growth rates for the derived populations are set to 0.
The
default_sampling_time
of theancestral
Population
is set to the time of this event, if thedefault_sampling_time
for the ancestral population has not already been set.
- add_admixture(time, *, derived, ancestral, proportions)[source]#
Adds an admixture event at the specified time. In an admixture event all lineages from a (more recent)
derived
population move to a list of (more ancient)ancestral
populations according to a list ofproportions
, such that a given lineage has a probabilityproportions[j]
of being moved to the populationancestral[j]
. This movement of lineages backwards in time corresponds to the initial state of the admixed derived population the specifiedtime
being composed of individuals from the specifiedancestral
populations in the specifiedproportions
.See the Admixture section for more details and examples.
In addition to moving lineages from the derived population into the ancestral population(s), an admixture has the following additional effects:
The derived population is set to inactive.
The ancestral populations are set to active, if they are not already active.
All migration rates to and from the derived population are set to 0.
Population sizes and growth rates for the derived population are set to 0, and the population is marked as inactive.
- add_mass_migration(time, *, source, dest, proportion)[source]#
Adds a mass migration (or “pulse migration”) event at the specified time. In a mass migration event, lineages in the
source
population are moved to thedest
population with probabilityproportion
. Forwards-in-time, this corresponds to individuals migrating from populationdest
to populationsource
.Please see the Pulse (mass) migration section for more details and examples.
Warning
Mass migrations are an advanced feature and should only be used if the required population dynamics cannot be modelled by Population split or Admixture events.
Important
Note that
source
anddest
are from the perspective of the coalescent process, i.e. backwards in time; please see the Migration section for more details.- Parameters:
time (float) – The time at which this event occurs in generations.
source (str, int) – The population from which lineages are moved.
dest (str, int) – The population to which lineages are moved.
proportion (float) – For each lineage in the
source
population, this is the probability that it moves to thedest
population.
- add_migration_rate_change(time, *, rate, source=None, dest=None)[source]#
Changes the rate of migration from one deme to another to a new value at a specific time. Migration rates are specified in terms of the rate at which lineages move from population
source
todest
during the progress of the simulation.Important
Note that
source
anddest
are from the perspective of the coalescent process, i.e. backwards in time; please see the Migration section for more details.By default,
source=None
anddest=None
, which results in all non-diagonal elements of the migration matrix being changed to the new rate. Ifsource
anddest
are specified, they must refer to valid populations (either integer IDs or string names).
- add_symmetric_migration_rate_change(time, populations, rate)[source]#
Sets the symmetric migration rate between all pairs of populations in the specified list to the specified value. For a given pair of population IDs
j
andk
, this setsmigration_matrix[j, k] = rate
andmigration_matrix[k, j] = rate
.Please see the Migration section for more details.
Populations may be specified either by their integer IDs or by their string names.
- add_population_parameters_change(time, *, initial_size=None, growth_rate=None, population=None)[source]#
Changes the size parameters of a population (or all populations) at a given time.
Please see the Populations section for more details.
- Parameters:
time (float) – The length of time ago at which this event occurred.
initial_size (float) – The number of individuals in the population at the beginning of the time slice starting at
time
. If None, the initial_size of the population is computed according to the initial population size and growth rate over the preceding time slice.growth_rate (float) – The new per-generation growth rate. If None, the growth rate is not changed. Defaults to None.
population (str, int) – The ID of the population affected. If
population
is None, the changes affect all populations simultaneously.
- add_simple_bottleneck(time, population, proportion=None)[source]#
Adds a population bottleneck at the specified time in which each lineage has probability equal to
proportion
of coalescing into a single ancestor.Please see the Simple bottleneck section for more details.
- add_instantaneous_bottleneck(time, *, population, strength)[source]#
Adds a bottleneck at the specified time in the specified population that is equivalent to the coalescent process running for
strength
generations.Please see the Instantaneous bottleneck section for more details.
Note
The ploidy is also use to scale the time scale of the coalescent process during the bottleneck.
- add_census(time)[source]#
Adds a “census” event at the specified time. In a census we add a node to each branch of every tree, thus recording the population that each lineage is in at the specified time.
This may be used to record all ancestral haplotypes present at that time, and to extract other information related to these haplotypes: for instance to trace the local ancestry of a sample back to a set of contemporaneous ancestors, or to assess whether a subset of samples has coalesced more recently than the census time. (However, these added nodes will only represent the portions of ancestral genomes inherited by the samples, rather than complete ancestral genomes.)
See Census events for more details.
Warning
When used in the conjunction with the DTWF model non-integer census times should be used to guarantee that the census nodes don’t coincide with coalescences (and therefore zero branch length errors). See Using the DTWF model for more details.
- Parameters:
time (float) – The time at which the census should occur.
- validate()[source]#
Checks the demography looks sensible and raises errors/warnings appropriately, and return a copy in which all default values have been appropriately resolved.
- copy(populations=None)[source]#
Returns a copy of this model. If the
populations
argument is specified, the populations in the copied model will be in this order.- Parameters:
populations (list) – A list of population identifiers defining the order of the populations in the new model. If not specified, the current order is used.
- Returns:
A copy of this Demography.
- debug()[source]#
Returns a
DemographyDebugger
instance for this demography.- Returns:
A DemographyDebugger object for this demography.
- Return type:
- assert_equal(other)[source]#
Compares this Demography with specified
other
and raises an AssertionError if they are not exactly equal.- Parameters:
other (Demography) – The other demography to compare against.
- is_equivalent(other, rel_tol=None, abs_tol=None)[source]#
Compares this demography with the other and return True if they are equivalent up to the specified numerical tolerances. Two demographies are equivalent if, they have the same set of epochs defined by demographic events, and for each epoch:
The population’s
initial_size
,growth_rate
andactive
values are equal in all populations.The migration matrices are equal
The same sequence of lineage movements through population splits, etc.
All numerical comparisons are performed using
math.isclose()
.- Parameters:
other (Demography) – The other demography to compare against.
rel_tol (float) – The relative tolerance used by math.isclose.
abs_tol (float) – The relative tolerance used by math.isclose.
- Returns:
True if this demography and other are equivalent up to numerical tolerances.
- Rtype bool:
bool
- static from_species_tree(tree, initial_size, *, time_units='gen', generation_time=None, growth_rate=None)[source]#
Parse a species tree in Newick format and return the corresponding
Demography
object. The tree is assumed to be rooted and ultrametric and branch lengths must be included and correspond to time, either in units of millions of years, years, or generations.The returned
Demography
object contains aPopulation
for each node in the species tree. The population’sname
attribute will be either the corresponding node label from the newick tree, if it exists, or otherwise the name takes the form “pop_{j}”, where j is the position of the given population in the list. Leaf populations are first in the list, and added in left-to-right order. Populations corresponding to the internal nodes are then added in a postorder traversal of the species tree. For each internal node a Population split event is added so that lineages move from its child populations at the appropriate time and rates of continuous migration to and from the child populations is set to zero. See the Population split section for more details.The initial sizes and growth rates for the populations in the model are set via the
initial_size
andgrowth_rate
arguments. These can be specified in two ways: if a single number is provided, this is used for all populations. The argument may also be a mapping from population names to their respective values. For example:tree = "(A:10.0,B:10.0)C" initial_size = {"A": 1000, "B": 2000, "C": 100} demography = msprime.Demography.from_species_tree(tree, initial_size)
Note that it is possible to have default population sizes for unnamed ancestral populations using a
collections.defaultdict
, e.g.,tree = "(A:10.0,B:10.0)" initial_size = collections.defaultdict(lambda: 100) initial_size.update({"A": 1000, "B": 2000}) demography = msprime.Demography.from_species_tree(tree, initial_size)
- Parameters:
tree (str) – The tree string in Newick format, with named leaves and branch lengths.
initial_size – Each population’s initial_size, in numbers of individuals. May be a single number or a mapping from population names to their sizes.
growth_rate – Each population’s growth_rate. May be a single number or a mapping from population names to their exponential growth rates. Defaults to zero.
time_units (str) – The units of time in which the species tree’s branch lengths are measured. Allowed branch length units are millions of years, years, and generations; these should be specified with the strings
"myr"
,"yr"
, or"gen"
, respectively. This defaults to"gen"
.generation_time (float) – The number of years per generation. If and only if the branch lengths are not in units of generations, the generation time must be specified. This defaults to None.
- Returns:
A Demography object representing the specified species tree.
- Return type:
- static from_starbeast(tree, generation_time, time_units='myr')[source]#
Parse a species tree produced by the program TreeAnnotator based on a posterior tree distribution generated with StarBEAST and return the corresponding Demography object.
Species trees produced by TreeAnnotator are written in Nexus format and are rooted, bifurcating, and ultrametric. Branch lengths usually are in units of millions of years, but the use of other units is permitted by StarBEAST (and thus TreeAnnotator). This function allows branch length units of millions of years or years. Leaves must be named and the tree must include information on population sizes of leaf and ancestral species in the form of annotation with the “dmv” tag, which is the case for trees written by TreeAnnotator based on StarBEAST posterior tree distributions.
The returned
Demography
object contains aPopulation
for each node in the species tree. The population’sname
attribute will be either the corresponding node label from the newick tree, if it exists, or otherwise the name takes the form “pop_{j}”, where j is the position of the given population in the list. Leaf populations are first in the list, and added in left-to-right order. Populations corresponding to the internal nodes are then added in a postorder traversal of the species tree. For each internal node a Population split event is added so that lineages move from its child populations at the appropriate time and rates of continuous migration to and from the child populations is set to zero. See the Population split section for more details.- Parameters:
tree (str) – The tree string in Nexus format, with named leaves, branch lengths, and branch annotation. Typically, this string is the entire content of a file written by TreeAnnotator.
generation_time (float) – The number of years per generation.
time_units (str) – The units of time in which the species tree’s branch lengths are measured. Allowed branch length units are millions of years, and years; these should be specified with the strings
"myr"
or"yr"
, respectively. This defaults to"myr"
.
- Returns:
A
Demography
instance that describing the information in the specified species tree.- Return type:
- static from_old_style(population_configurations=None, *, migration_matrix=None, demographic_events=None, Ne=1, ignore_sample_size=False, population_map=None)[source]#
Creates a Demography object from the pre 1.0 style input parameters, reproducing the old semantics with respect to default values.
No sample information is stored in the new-style
Demography
objects, and therefore if thesample_size
attribute of any of the inputPopulationConfiguration
objects is set a ValueError will be raised by default. However, if theignore_sample_size
parameter is set to True, this check will not be performed and the sample sizes specified in the old-stylePopulationConfiguration
objects will be ignored.Each
PopulationConfiguration
instance in the list ofpopulation_configurations
corresponds to the equivalentPopulation
object in the returnedDemography
. If a PopulationConfiguration hasmetadata
defined and this dictionary contains aname
field, this will be used as thePopulation
name. Otherwise, the default population names will be used.Please see the Specifying samples section for details on how to specify sample locations in
sim_ancestry()
.Todo
Document the remaining parameters.
- static from_demes(graph)[source]#
Creates a
Demography
object from the specified demes graph. Time values in the demes graph may be specified in any units, but the returned object has units converted into generations. See the Demes section for details.import demes graph = demes.load("model.yaml") demography = msprime.Demography.from_demes(graph)
- Parameters:
graph (demes.Graph) – A demes graph.
- Returns:
A
Demography
instance corresponding to the demes model.- Return type:
- static from_tree_sequence(ts, initial_size=0)[source]#
Creates a
Demography
object based on the information in the specifiedtskit.TreeSequence
. The returned demography will contain a population for each of the populations in the tree sequence, in the same order.The metadata for each population in the tree sequence will be inspected. If a schema is present and the metadata can be decoded, the
name
anddescription
properties of populations are set if the corresponding keys are present.If the metadata cannot be decoded, the default values for
name
anddescription
are used.The
initial_size
of each of the new populations is set to zero by default, and all otherPopulation
attributes are set to their default values. It is therefore essential to update theinitial_size
andgrowth_rate
values to reflect the desired demography.See also
See the initial state section for examples of how this method can be used.
- Parameters:
ts (tskit.TreeSequence) – The tree sequence to extract population information from.
initial_size (float) – The default initial size for the newly added populations, as a number of individuals (Default=0).
- Returns:
A Demography object representing the populations in the specified tree sequence.
- Return type:
- static isolated_model(initial_size, *, growth_rate=None)[source]#
Returns a
Demography
object representing a collection of isolated populations with specified initial population sizes and growth rates. Please see Demographic models for more details on population sizes and growth rates.- Parameters:
initial_size (list) – the
initial_size
value for each of thePopulation
in the returned model. The length of the array corresponds to the number of populations. model.growth_rate (list) – The exponential growth rate for each population. Must be either None (the default, resulting a zero growth rate) or an array with the same length as
initial_size
.
- Returns:
A Demography object representing this model, suitable as input to
sim_ancestry()
.- Return type:
- static island_model(initial_size, migration_rate, *, growth_rate=None)[source]#
Returns a
Demography
object representing a collection of populations with specified initial population sizes and growth rates, with symmetric migration between each pair of populations at the specified rate. Please see Demographic models for more details on population sizes and growth rates.- Parameters:
initial_size (list) – the
initial_size
value for each of thePopulation
in the returned model. The length of the array corresponds to the number of populations. model.migration_rate (float) – The migration rate between each pair of populations.
growth_rate (list) – The exponential growth rate for each population. Must be either None (the default, resulting a zero growth rate) or an array with the same length as
initial_size
.
- Returns:
A Demography object representing this model, suitable as input to
sim_ancestry()
.- Return type:
- static stepping_stone_model(initial_size, migration_rate, *, growth_rate=None, boundaries=False)[source]#
Returns a
Demography
object representing a collection of populations with specified initial population sizes and growth rates, in which adjacent demes exchange migrants at the specified rate. Please see Demographic models for more details on population sizes and growth rates.Note
The current implementation only supports a one-dimensional stepping stone model, but higher dimensions could also be supported. Please open an issue on GitHub if this feature would be useful to you.
- Parameters:
initial_size (list) – the
initial_size
value for each of thePopulation
in the returned model. The length of the array corresponds to the number of populations.migration_rate (float) – The migration rate between adjacent pairs of populations.
growth_rate (list) – The exponential growth rate for each population. Must be either None (the default, resulting a zero growth rate) or an array with the same length as
initial_size
.boundaries (bool) – If True the stepping stone model has boundary conditions imposed so that demes at either end of the chain do not exchange migrants. If False (the default), the set of populations is “circular” and migration takes place between the terminal demes.
- Returns:
A Demography object representing this model, suitable as input to
sim_ancestry()
.- Return type:
- to_demes()[source]#
Creates a
demes.Graph
object from the demography. See the Demes section for details.Note
Demographic models using bottlenecks added via the
add_simple_bottleneck()
oradd_instantaneous_bottleneck()
methods are not able to be converted into a demes graph.Note
Demes is stricter than msprime with regard to how a demographic model is structured, so models that can be simulated with msprime are not guaranteed to be convertible to a demes graph. In particular, msprime’s legacy API permits setting migrations or other attributes for a population even after that population has been merged into an ancestor. Such models are rarely constructed deliberately, so an error during conversion of a legacy model could indicate model misspecification.
The returned graph can be saved as a Demes-format YAML file using the demes API.
import demes demography = msprime.Demography.island_model([1000] * 3, 1e-5) graph = demography.to_demes() demes.dump(graph, "island_model.yaml")
Or plotted using the demesdraw visualisation package.
import demesdraw demography = msprime.Demography.island_model([1000] * 3, 1e-5) graph = demography.to_demes() ax = demesdraw.tubes(graph) ax.figure.savefig("island_model.pdf")
- Returns:
A
demes.Graph
object corresponding to the demography.- Return type:
- class msprime.Population(initial_size=0.0, growth_rate=0.0, name=None, description='', extra_metadata=<factory>, default_sampling_time=None, initially_active=None, id=None)[source]#
A single population in a
Demography
. See the Populations section for more information on what populations represent, and how they can be used.Warning
This class should not be instantiated directly. Please use
Demography.add_population()
method instead.- initial_size = 0.0#
The absolute size of the population at time zero. See the Initial size section for more details and examples.
- growth_rate = 0.0#
The exponential growth rate of the population per generation (forwards in time). Growth rates can be negative. This is zero for a constant population size, and positive for a population that has been growing. See the Growth rate section for more details and examples.
- name = None#
The name of the population. If specified this must be a uniquely identifying string and must be a valid Python identifier (i.e., could be used as a variable name in Python code). See Identifying populations for more details and recommendations on best practise.
- description = ''#
A concise description of the population. Defaults to the empty string if not specified.
- extra_metadata#
A JSON-encodable dictionary of metadata items to be stored in the associated tskit population object. This dictionary must not contain keys for any of the pre-defined metadata items. See the Metadata section for more details and examples.
- default_sampling_time = None#
The default time at which samples are drawn from this population. See the Default sampling time section for more details.
- initially_active = None#
If True, this population will always be initially active, regardless of whether it participates in a Population split. If not set, or None, the initial state of the population will be set automatically depending on the events declared in the demography. See the Life cycle section for more details.
- id = None#
The integer ID of this population within the parent
Demography
. This attribute is assigned by the Demography class and should not be set or changed by user code.
- class msprime.DemographyDebugger(Ne=1, population_configurations=None, migration_matrix=None, demographic_events=None, model=None, *, demography=None)[source]#
Utilities to compute and display information about the state of populations during the different simulation epochs defined by demographic events.
Warning
This class is not intended to be instantiated directly using the constructor - please use
Demography.debug()
to obtain a DemographyDebugger for a givenDemography
instead.- print_history(output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]#
Prints a summary of the history of the populations.
Deprecated since 1.0: use
print(debugger)
instead.
- property population_size_history#
Returns a (num_pops, num_epochs) numpy array giving the starting population size for each population in each epoch.
- property epoch_start_time#
The array of epoch start_times defined by the demographic model.
- property num_epochs#
Returns the number of epochs defined by the demographic model.
- population_size_trajectory(steps)[source]#
Return an array of per-population population sizes, as defined by the demographic model. These are the initial_size parameters of the model, modified by any population growth rates. The sizes are computed at the time points given by steps.
- Parameters:
steps (list) – List of times ago at which the population size will be computed.
- Returns:
Returns a numpy array of population sizes, with one column per population, whose [i,j]th entry is the size of population j at time steps[i] ago.
- lineage_probabilities(steps, sample_time=0)[source]#
Returns an array such that P[j, a, b] is the probability that a lineage that started in population a at time sample_time is in population b at time steps[j] ago.
This function reports sampling probabilities _before_ mass migration events (or other events that move lineages) at a step time, if a mass migration event occurs at one of those times. Migrations will then effect the next time step.
- Parameters:
steps (list) – A list of times to compute probabilities.
sample_time – The time of sampling of the lineage. For any times in steps that are more recent than sample_time, the probability of finding the lineage in any population is zero.
- Returns:
An array of dimension len(steps) by num pops by num_pops.
- possible_lineage_locations(samples=None)[source]#
Given the sampling configuration, this function determines when lineages are possibly found within each population over epochs defined by demographic events and sampling times. If no sampling configuration is given, we assume we sample lineages from every population at time zero.
The epoch intervals returned are those in which there are distinct configurations of possible lineage locations, and so the number of returned epochs may be less than the total number of epochs defined by the demography and will depend on the input sample configuration.
The samples are specified by either a list of population identifiers ( integer IDs or string names) or by a list of
SampleSet
objects, allowing sampling times to be specified explicitly. If thetime
field of theSampleSet
is not specified (or population IDs are used) samples are taken at the population’s default_sampling_time. OnlySampleSet
objects withnum_samples > 0
are counted as contributing samples to a particular population.To support legacy code,
Sample
objects from the 0.x API can also provided, although its use is discouraged in new code.- Parameters:
samples (list) – The populations that we sample from. Can be either a list of population identifiers,
SampleSet
orSample
objects.- Returns:
Returns a dictionary with epoch intervals as keys whose values are a list with length equal to the number of populations with True and False indicating which populations could possibly contain lineages over that epoch. The epoch intervals are given by tuples: (epoch start, epoch end). The first epoch necessarily starts at time 0, and the final epoch has end time of infinity.
- mean_coalescence_time(lineages, min_pop_size=1, steps=None, rtol=0.005, max_iter=12)[source]#
Compute the mean time until coalescence between pairs of the specified sample
lineages
. Sample lineages are specified as a mapping from populations to the number of monoploid sample genomes present in that population at time zero. See the Coalescence rates and mean times section for usage examples and more details.Important
This function assumes a diploid model when computing coalescence rates (see the Coalescent time scales section for more information).
The calculation is performed by using
coalescence_rate_trajectory()
to compute the probability that the lineages have not yet coalesced by time t, and using these to approximate \(E[T] = \int_t^\infty P(T > t) dt\), where \(T\) is the coalescence time. Seecoalescence_rate_trajectory()
for more details.To compute this, an adequate time discretisation must be arrived at by iteratively extending or refining the current discretisation. Debugging information about numerical convergence of this procedure is logged using the Python
logging
infrastructure. The daiquiri module is a convenient way to set up logging, and we can use it to make these messages appear on stderr like this:import daiquiri daiquiri.setup(level="DEBUG") debugger.mean_coalescence_time(1)
Briefly, this outputs iteration number, mean coalescence time, maximum difference in probability of not having coalesced yet, difference to last coalescence time, probability of not having coalesced by the final time point, and whether the last iteration was an extension or refinement.
- Parameters:
lineages (dict) – A mapping of populations (either integer IDs or string names: see the Identifying populations section for more details) to the number of monoploid sample lineages in that population.
min_pop_size (int) – See
coalescence_rate_trajectory()
.steps (list) – The time discretisation to start out with (by default, picks something based on epoch times).
rtol (float) – The relative tolerance to determine mean coalescence time to (used to decide when to stop subdividing the steps).
max_iter (int) – The maximum number of times to subdivide the steps.
- Returns:
The mean coalescence time (a number).
- Return type:
- coalescence_rate_trajectory(steps, lineages, min_pop_size=1, double_step_validation=True)[source]#
Calculate the mean coalescence rates and proportions of uncoalesced lineages between pairs of the specified sample lineages, at each of the times ago listed by steps, in this demographic model. Sample lineages are specified as a mapping from populations to the number of monoploid sample genomes present in that population at time zero. See the Inverse instantaneous coalescence rates section for usage examples and more details.
The coalescence rate at time t in the past is the average rate of coalescence of as-yet-uncoalesced lineages, computed as follows: let \(p(t)\) be the probability that the lineages of a randomly chosen pair of samples has not yet coalesced by time \(t\), let \(p(z,t)\) be the probability that the lineages of a randomly chosen pair of samples has not yet coalesced by time \(t\) and are both in population \(z\), and let \(N(z,t)\) be the diploid effective population size of population \(z\) at time \(t\). Then the mean coalescence rate at time \(t\) is \(r(t) = (\sum_z p(z,t) / (2 * N(z,t))) / p(t)\).
The computation is done by approximating population size trajectories with piecewise constant trajectories between each of the steps. For this to be accurate, the distance between the steps must be small enough so that (a) short epochs (e.g., bottlenecks) are not missed, and (b) populations do not change in size too much over that time, if they are growing or shrinking. This function optionally provides a simple check of this approximation by recomputing the coalescence rates on a grid of steps twice as fine and throwing a warning if the resulting values do not match to a relative tolerance of 0.001.
- Parameters:
steps (list) – The times ago at which coalescence rates will be computed.
lineages (dict) – A mapping of populations (either integer IDs or string names: see the Identifying populations section for more details) to the number of monoploid sample lineages in that population.
min_pop_size (int) – The smallest allowed population size during computation of coalescent rates (i.e., coalescence rates are actually 1 / (2 * max(min_pop_size, N(z,t))). Spurious very small population sizes can occur in models where populations grow exponentially but are unused before some time in the past, and lead to floating point error. This should be set to a value smaller than the smallest desired population size in the model.
double_step_validation (bool) – Whether to perform the check that step sizes are sufficiently small, as described above. This is highly recommended, and will take at most four times the computation.
- Returns:
A tuple of arrays whose jth elements, respectively, are the coalescence rate at the jth time point (denoted r(t[j]) above), and the probability that a randomly chosen pair of lineages has not yet coalesced (denoted p(t[j]) above).
- Return type:
Likelihoods#
- msprime.log_arg_likelihood(ts, recombination_rate, Ne=1)[source]#
Returns the log probability of the stored tree sequence under the Hudson ARG. An exact expression for this probability is given in equation (1) of Kuhner et al. (2000).
We assume branch lengths stored in generations, resulting in a coalescence rate of \(1 / (2 N_e)\) per pair of lineages.
Warning
The stored tree sequence must store the full realisation of the ARG, including all recombination events and all common ancestor events, regardless of whether the recombinations cause a change in the ancestral tree or whether the common ancestor events cause coalescence of ancestral material. See Ancestral recombination graph for details of this data structure, and how to generate them using
msprime
.Warning
This method only supports continuous genomes. See Discrete or continuous? for how these can be specified when simulating tree sequences using
msprime
.- Parameters:
ts (tskit.TreeSequence) – The tree sequence object.
recombination_rate (float) – The per-link, per-generation recombination probability. Must be non-negative.
Ne (float) – The diploid effective population size.
- Returns:
The log probability of the tree sequence under the Hudson ancestral recombination graph model. If the recombination rate is zero and the tree sequence contains at least one recombination event, then returns -DBL_MAX.
- msprime.log_mutation_likelihood(ts, mutation_rate)[source]#
Returns the unnormalised log probability of the stored pattern of mutations on the stored tree sequence, assuming infinite sites mutation. In particular, each stored site must only contain a single mutation, and all mutant alleles are treated interchangeably as having arisen from a single mutation rate. This is automatically true for some msprime mutation models, such as JC69. The omitted normalising constant depends on the pattern of mutations, but not on the tree sequence or the mutation rate.
Warning
The infinite sites assumption means that this method is only valid for continuous genomes. It can also be run on discrete genomes, for which it may provide a reasonable approximation when the number of sites is large and the mutation rate is low. See Discrete or continuous? for how continuous genomes can be specified when simulating tree sequences using msprime.
The function first computes the probability of the overall number of mutations \(M\) from the Poisson probability mass function
\[e^{-T \mu / 2} \frac{(T \mu / 2)^M}{M!},\]where \(T\) is the total area of ancestral material in the tree sequence stored in units of generations, and \(\mu\) is the per-site, per-generation mutation probability. Each mutation then contributes an individual factor of \(l / T\), where \(l\) is the total branch length on which the mutation could have arisen while appearing on all of the required lineages, again stored in generations.
Warning
If a tree at the site of a mutation contains unary nodes, then \(l\) could span more than one edge. In particular, we do not constrain mutations to take place on the edge directly above the node on which they have been recorded, but rather on any edge which would yield the same configuration of SNPs at the leaves of the tree sequence.
- Parameters:
ts (tskit.TreeSequence) – The tree sequence object with mutations.
mutation_rate (float) – The per-site, per-generation mutation probability. Must be non-negative.
- Returns:
The unnormalised log probability of the observed SNPs given the tree sequence. If the mutation rate is set to zero and the tree sequence contains at least one mutation, then returns -float(“inf”).
Pedigrees#
- class msprime.PedigreeBuilder(demography=None, individuals_metadata_schema=None)[source]#
Utility for building pedigrees in the format required for input to the
FixedPedigree
ancestry model.See also
See the Pedigrees section for more information on how pedigrees are described and imported in msprime.
Example:
pb = msprime.PedigreeBuilder() mom_id = pb.add_individual(time=1) dad_id = pb.add_individual(time=1) pb.add_individual(time=0, parents=[mom_id, dad_id], is_sample=True) pedigree_tables = pb.finalise()
- Parameters:
demography – The
Demography
defining populations referred to in thepopulations
column, if specified. If None (the default) a demography consisting of one population is used (and only population 0 can be referred to).individuals_metadata_schema (tskit.MetadataSchema) – If specified, set the
metadata_schema
for the individuals table in the final table collection. Must be an instance oftskit.MetadataSchema
See the Metadata section for more information.
- add_individual(*, time, is_sample=None, parents=None, population=None, metadata=None)[source]#
Adds an individual with the specified properties, returning its ID.
- Parameters:
time (float) – The time for this individual measured in generations ago.
is_sample (bool) – If True, the new individual is marked as a sample; if False, the individual is not marked as a sample. If None (the default) the individual is marked as a sample if its
time
is zero. Parent IDs are not checked, and may refer to individuals not yet added to the tables.parents (list(int)) – The integer IDs of the specified individual’s parents. Exactly two parents must be specified. If None (the default), the individual is treated as a founder by setting its parents to
[-1, -1]
.population (str|int) – The population to associated with this individual. The value can be a string or integer, following the usual population identifier rules. If None (the default), a population ID of 0 will be assigned if the
Demography
associated with this PedigreeBuilder has a single population (the default). If the demography has more than one population, then a population must be explicitly specified for each individual.metadata (dict|bytes) – Any metadata to associate with the new individual. See the Metadata section for more information.
- Returns:
The ID of the newly added individual.
- Return type:
- finalise(sequence_length=None)[source]#
Returns the
tskit.TableCollection
describing the pedigree defined by calls toadd_individual()
.The state of the pedigree builder is not modified by this method.
- Parameters:
sequence_length (float) – If specified, set the
sequence_length
property of the returned TableCollection to this value. IfNone
(the default) thesequence_length
is-1
.- Returns:
The TableCollection defining the pedigree.
- Return type:
- msprime.parse_pedigree(text_file, *, demography=None, sequence_length=None)[source]#
Parse a text file describing a pedigree used for input to the
FixedPedigree
ancestry model. See the Pedigree encoding section for more information on the data encoding used for pedigrees.See also
See the Format definition section for a detailed description of the columns and formatting requirements for this file format.
The returned
tskit.TableCollection
will contain an individual and two nodes for each data row in the input file. The individual will have a metadata fieldfile_id
containing the value of theid
column in the input. Individuals (and their corresponding nodes) are added to the tables in the order seen in the file. There is no ordering requirement for parents and children.If a
Demography
instance is provided to thedemography
parameter, this is used to translate and validate population identifiers in thepopulation
column, and is also used to fill the population table in the outputtskit.TableCollection
. See the Pedigrees and demography section for more information on the interaction between demography andFixedPedigree
simulations.- Parameters:
text_file – A file-like object to read from.
demography – The
Demography
defining populations referred to in thepopulations
column, if specified. If None (the default) a demography consisting of one population is used (and only population 0 can be referred to).sequence_length (float) – If specified, set the
sequence_length
property of the returned TableCollection to this value.
- Returns:
The
tskit.TableCollection
object containing the corresponding pedigree data.- Return type: