API Reference#

This lists the detailed reference documentation for the msprime Python API. The reference documentation aims to be concise, precise and exhaustive; as such, it is not the best place to start if you are new to a particular piece of functionality. Please see the top-level documentation for Ancestry simulations, Mutation simulations or Demographic models for discussion and examples of individual features.

Summary#

Ancestry#

sim_ancestry([samples, demography, ...])

Simulates an ancestral process described by the specified model, demography and samples, and return a tskit.TreeSequence (or a sequence of replicate tree sequences).

SampleSet(num_samples[, population, time, ...])

Specify a set of exchangeable sample individuals with a given ploidy value from a population at a given time.

StandardCoalescent(*[, duration])

The classical coalescent with recombination model (i.e., Hudson's algorithm).

SmcApproxCoalescent(*[, duration])

The Sequentially Markov Coalescent (SMC) model defined by McVean and Cardin (2005).

SmcPrimeApproxCoalescent(*[, duration])

The SMC' model defined by Marjoram and Wall (2006) as a refinement of the SMC.

DiscreteTimeWrightFisher(*[, duration])

A discrete backwards-time Wright-Fisher model, with diploid back-and-forth recombination.

FixedPedigree(*[, duration])

Backwards-time simulations through a pre-specified pedigree, with diploid individuals and back-and-forth recombination.

BetaCoalescent(*[, duration, alpha, ...])

A Lambda-coalescent with multiple mergers in the haploid cases, or a Xi-coalescent with simultaneous multiple mergers in the polyploid case.

DiracCoalescent(*[, duration, psi, c])

A Lambda-coalescent with multiple mergers in the haploid cases, or a Xi-coalescent with simultaneous multiple mergers in the polyploid case.

SweepGenicSelection(*[, duration, position, ...])

A selective sweep that has occurred in the history of the sample.

Mutations#

sim_mutations(tree_sequence[, rate, ...])

Simulates mutations on the specified ancestry and returns the resulting tskit.TreeSequence.

JC69([state_independent])

Jukes-Cantor mutation model (Jukes and Cantor 1969).

HKY(kappa[, equilibrium_frequencies, ...])

The Hasegawa, Kishino and Yano mutation model (Hasegawa et al. 1985).

F84(kappa[, equilibrium_frequencies, ...])

The F84 mutation model (Felsenstein and Churchill, 1996).

GTR(relative_rates[, ...])

The Generalised Time-Reversible nucleotide mutation model, a general parameterisation of a time-reversible mutation process (Tavaré et al. 1986).

BLOSUM62()

The BLOSUM62 model of time-reversible amino acid mutation.

PAM()

The PAM model of time-reversible amino acid mutation.

BinaryMutationModel([state_independent])

Basic binary mutation model with two alleles: "0" and "1", and ancestral allele always "0".

MatrixMutationModel

Superclass of the specific mutation models with a finite set of states.

MicrosatMutationModel(*[, s, u, v, p, m, ...])

General class for Microsatellite mutation models which parameterizes the transition matrix describing changes among alleles using five parameters according to the model of Sainudiin et al. (2004).

SMM([lo, hi, root_distribution])

Stepwise mutation model (Ohta and Kimura, 1978), for microsatellite repeat number.

TPM(*, p, m[, lo, hi, root_distribution])

Two-phase mutation model of DiRienzo et al. (1994).

EL2(*, m, u, v[, lo, hi, root_distribution])

Equal rate, linear biased, two-phase mutation model of Garza et al. (1995) as parameterized by Sainudiin et al. (2004).

InfiniteAlleles

An infinite alleles model of mutation in which each allele is a unique positive integer.

SLiMMutationModel

An infinite-alleles model of mutation producing "SLiM-style" mutations.

Demography#

Population([initial_size, growth_rate, ...])

A single population in a Demography.

Demography([populations, events, ...])

The definition of a demographic model for an msprime simulation, consisting of a set of populations, a migration matrix, and a list of demographic events.

DemographyDebugger([Ne, ...])

Utilities to compute and display information about the state of populations during the different simulation epochs defined by demographic events.

Utilities#

RateMap(*, position, rate)

A class mapping a non-negative rate value to a set of non-overlapping intervals along the genome.

PedigreeBuilder([demography, ...])

Utility for building pedigrees in the format required for input to the FixedPedigree ancestry model.

parse_pedigree(text_file, *[, demography, ...])

Parse a text file describing a pedigree used for input to the FixedPedigree ancestry model.

Reference documentation#

Ancestry#

msprime.sim_ancestry(samples=None, *, demography=None, sequence_length=None, discrete_genome=None, recombination_rate=None, gene_conversion_rate=None, gene_conversion_tract_length=None, population_size=None, ploidy=None, model=None, initial_state=None, start_time=None, end_time=None, record_migrations=None, record_full_arg=None, additional_nodes=None, coalescing_segments_only=None, num_labels=None, random_seed=None, num_replicates=None, replicate_index=None, record_provenance=None)[source]#

Simulates an ancestral process described by the specified model, demography and samples, and return a tskit.TreeSequence (or a sequence of replicate tree sequences).

Parameters:
  • samples – The sampled individuals as either an integer, specifying the number of individuals to sample in a single-population model; or a list of SampleSet objects defining the properties of groups of similar samples; or as a mapping in which the keys are population identifiers (either an integer ID or string name) and the values are the number of samples to take from the corresponding population at its default sampling time. It is important to note that samples correspond to individuals here, and each sampled individual is usually associated with \(k\) sample nodes (or genomes) when ploidy = \(k\). See the Specifying samples section for further details. Either samples or initial_state must be specified.

  • demography – The demographic model to simulate, describing the extant and ancestral populations, their population sizes and growth rates, their migration rates, and demographic events affecting the populations over time. See the Demographic models section for details on how to specify demographic models and Specifying samples for details on how to specify the populations that samples are drawn from. If not specified (or None) we default to a single population with constant size 1 (see also the population_size parameter).

  • ploidy (int) – The number of monoploid genomes per sample individual (Default=2). See the Ploidy section for usage examples.

  • sequence_length (float) – The length of the genome sequence to simulate. See the Sequence length section for usage examples for this parameter and how it interacts with other parameters.

  • discrete_genome (bool) – If True (the default) simulation occurs in discrete genome coordinates such that recombination and gene conversion breakpoints always occur at integer positions. Thus, multiple (e.g.) recombinations can occur at the same genome position. If discrete_genome is False simulations are performed using continuous genome coordinates. In this case multiple events at precisely the same genome location are very unlikely (but technically possible). See the Discrete or continuous? section for usage examples.

  • recombination_rate – The rate of recombination along the sequence; can be either a single value (specifying a single rate over the entire sequence) or an instance of RateMap. See the Recombination section for usage examples for this parameter and how it interacts with other parameters.

  • gene_conversion_rate – The rate of gene conversion along the sequence. If provided, a value for gene_conversion_tract_length must also be specified. See the Gene conversion section for usage examples for this parameter and how it interacts with other parameters.

  • gene_conversion_tract_length – The mean length of the gene conversion tracts. For discrete genomes the tract lengths are geometrically distributed with mean gene_conversion_tract_length, which must be greater than or equal to 1. For continuous genomes the tract lengths are exponentially distributed with mean gene_conversion_tract_length, which must be larger than 0.

  • population_size – The number of individuals of the default single population Demography. If not specified, defaults to 1. Cannot be specified along with the demography parameter. See the Demographic models section for more details on demographic models and population sizes and the Demography section for usage examples.

  • random_seed (int) – The random seed. If this is not specified or None, a high-quality random seed will be automatically generated. Valid random seeds must be between 1 and \(2^{32} - 1\). See the Random seeds section for usage examples.

  • num_replicates (int) – The number of replicates of the specified parameters to simulate. If this is not specified or None, no replication is performed and a tskit.TreeSequence object returned. If num_replicates is provided, the specified number of replicates is performed, and an iterator over the resulting tskit.TreeSequence objects returned. See the Running replicate simulations section for examples.

  • record_full_arg (bool) – If True, record all intermediate nodes arising from common ancestor and recombination events in the output tree sequence. This will result in unary nodes (i.e., nodes in marginal trees that have only one child). Defaults to False. See the Ancestral recombination graph section for examples.

  • additional_nodes (NodeType) – Retain all ancestry for any node of the specified type. This will result in unary nodes. Defaults to class msprime.NodeType(0).

  • coalescing_segments_only (bool) – If False, retain all ancestry for any nodes that are coalescent nodes anywhere in the sequence. This will result in unary nodes. Should be set to False when recording additional nodes. Defaults to True.

  • record_migrations (bool) – If True, record all migration events that occur in the Migration Table of the output tree sequence. Defaults to False. See the Migration events section for examples.

  • initial_state (tskit.TreeSequence) – If specified, initialise the simulation from the root segments of this tree sequence and return the completed tree sequence. Please see Specifying the initial state for details of the required properties of this tree sequence and its interactions with other parameters. All information in the initial_state tables is preserved (including metadata) and included in the returned tree sequence. (Default: None).

  • start_time (float) – If specified, set the initial time that the simulation starts to this value. If not specified, the start time is zero if performing a simulation of a set of samples, or is the time of the oldest node if simulating from an existing tree sequence (see the initial_state parameter). See the Setting the start time section for examples.

  • end_time (float) – If specified, terminate the simulation at the specified time. In the returned tree sequence, all rootward paths from samples with time < end_time will end in a node with one child with time equal to end_time. Any sample nodes with time >= end_time will also be present in the output tree sequence. If not specified or None, run the simulation until all samples have an MRCA at all positions in the genome. See the Stopping simulations early section for examples.

  • record_provenance (bool) – If True (the default), record all input parameters in the tree sequence Provenance.

  • model (str or msprime.AncestryModel or list) – The ancestry model to use. This can be either a single instance of AncestryModel (or a string that can be interpreted as an ancestry model), or a list of AncestryModel instances. If the duration attribute of any of these models is set, the simulation will be run until at most \(t + t_m\), where \(t\) is the simulation time when the model starts and \(t_m\) is the model’s duration. If the duration is not set, the simulation will continue until the model completes, the overall end_time is reached, or overall coalescence. See the Specifying ancestry models section for more details, and the Models section for the available models and examples.

Returns:

The tskit.TreeSequence object representing the results of the simulation if no replication is performed, or an iterator over the independent replicates simulated if the num_replicates parameter has been used.

Return type:

tskit.TreeSequence or an iterator over tskit.TreeSequence replicates.

class msprime.SampleSet(num_samples, population=None, time=None, ploidy=None)[source]#

Specify a set of exchangeable sample individuals with a given ploidy value from a population at a given time. See the Specifying samples section for details and examples.

num_samples#

The number of k-ploid sample individuals to draw.

population = None#

The population in which the samples are drawn. May be either a string name or integer ID (see Identifying populations details).

time = None#

The time at which these samples are drawn. If not specified or None, defaults to the Population.default_sampling_time.

ploidy = None#

The number of monoploid genomes to sample for each sample individual. See the Ploidy section for more details and examples.

class msprime.NodeType(value)[source]#

Specify the type of node for which you want to track ancestry. Extends the enum.Flag class. The NodeType can be specified by means of bitwise operators on the members of the enumeration. See additional nodes for a definition of each of the different node types.

RECOMBINANT = msprime.NodeType(1<<17)#
COMMON_ANCESTOR = msprime.NodeType(1<<18)#
MIGRANT = msprime.NodeType(1<<19)#
CENSUS = msprime.NodeType(1<<20)#
GENE_CONVERSION = msprime.NodeType(1<<21)#
PASS_THROUGH = msprime.NodeType(1<<22)#

Models#

class msprime.AncestryModel(*, duration=None)[source]#

Abstract superclass of all ancestry models.

duration#

The time duration that this model should run for. If None, the model will run until completion (i.e., until the simulation coalesces or the model itself completes). Otherwise, this defines the maximum time duration which the model can run. See the Specifying ancestry models section for more details.

class msprime.StandardCoalescent(*, duration=None)[source]#

The classical coalescent with recombination model (i.e., Hudson’s algorithm). The string "hudson" can be used to refer to this model.

This is a continuous time model in which the time to the next event is exponentially distributed with rates depending on the population size(s), migration rates, numbers of extant lineages and the amount of ancestral material currently present. See Kelleher et al. (2016) for a detailed description of the model and further references.

class msprime.SmcApproxCoalescent(*, duration=None)[source]#

The Sequentially Markov Coalescent (SMC) model defined by McVean and Cardin (2005). In the SMC, only common ancestor events that result in marginal coalescences are possible. Under this approximation, the marginal trees along the genome depend only on the immediately previous tree (i.e. are Markovian).

Note

This model is implemented using a naive rejection sampling approach and so it may not be any more efficient to simulate than the standard Hudson model.

The string "smc" can be used to refer to this model.

class msprime.SmcPrimeApproxCoalescent(*, duration=None)[source]#

The SMC’ model defined by Marjoram and Wall (2006) as a refinement of the SMC. The SMC’ extends the SMC by additionally allowing common ancestor events that join contiguous tracts of ancestral material (as well as events that result in marginal coalescences).

Note

This model is implemented using a naive rejection sampling approach and so it may not be any more efficient to simulate than the standard Hudson model.

The string "smc_prime" can be used to refer to this model.

class msprime.DiscreteTimeWrightFisher(*, duration=None)[source]#

A discrete backwards-time Wright-Fisher model, with diploid back-and-forth recombination. The string "dtwf" can be used to refer to this model.

Wright-Fisher simulations are performed very similarly to coalescent simulations, with all parameters denoting the same quantities in both models. Because events occur at discrete times however, the order in which they occur matters. Each generation consists of the following ordered events:

  • Migration events. As in the Hudson coalescent, these move single extant lineages between populations. Because migration events occur before lineages choose parents, migrant lineages choose parents from their new population in the same generation.

  • Demographic events. All events with previous_generation < event_time <= current_generation are carried out here.

  • Lineages draw parents. Each (monoploid) extant lineage draws a parent from their current population.

  • Diploid recombination. Each parent is diploid, so all child lineages recombine back-and-forth into the same two parental genome copies. These become two independent lineages in the next generation.

  • Historical sampling events. All historical samples with previous_generation < sample_time <= current_generation are inserted.

class msprime.FixedPedigree(*, duration=None)[source]#

Backwards-time simulations through a pre-specified pedigree, with diploid individuals and back-and-forth recombination. The string "fixed_pedigree" can be used to refer to this model.

The model has no parameters, but requires that the initial_state parameter is provided to sim_ancestry() and that it contains a valid pedigree. See the Pedigrees section for more information on the required encoding and methods for translating existing pedigrees into this format.

Unlike other ancestry models, the FixedPedigree cannot be combined with other simulation models, and special attention must be paid to the role of founders when recapitating these simulations.

See the Fixed pedigree section for more information for more information and examples of running simulations within a fixed pedigree.

class msprime.BetaCoalescent(*, duration=None, alpha=None, truncation_point=1.7976931348623157e+308)[source]#

A Lambda-coalescent with multiple mergers in the haploid cases, or a Xi-coalescent with simultaneous multiple mergers in the polyploid case.

There are two main differences between the Beta-coalescent and the standard coalescent. Firstly, the number of lineages that take part in each common ancestor event is random, with distribution determined by moments of the \(Beta(2 - \alpha, \alpha)\)-distribution. In particular, when there are \(n\) lineages, each set of \(k \leq n\) of them participates in a common ancestor event at rate

\[\frac{1}{B(2 - \alpha, \alpha)} \int_0^1 x^{k - \alpha - 1} (1 - x)^{n - k + \alpha - 1} dx,\]

where \(B(2 - \alpha, \alpha)\) is the Beta-function.

If ploidy = 1, then all participating lineages merge into one common ancestor, corresponding to haploid, single-parent reproduction. If ploidy = \(p > 1\), all participating lineages split randomly into \(2 p\) groups, corresponding to two-parent reproduction with \(p\) copies of each chromosome per parent. All lineages within each group merge simultaneously.

Secondly, the number of generations between common ancestor events predicted by the Beta-coalescent is proportional to \(N^{\alpha - 1}\), where \(N\) is the population size. Specifically, the mean number of generations until two lineages undergo a common ancestor event is

\[G = \frac{m^{\alpha} N^{\alpha - 1}}{\alpha B(2 - \alpha, \alpha)},\]

if ploidy = 1, and

\[G = \frac{2 p m^{\alpha} (N / 2)^{\alpha - 1}} {\alpha B(2 - \alpha, \alpha)},\]

if ploidy = \(p > 1\), where \(m\) is the mean number of juveniles per family given by

\[m = 2 + \frac{2^{\alpha}}{3^{\alpha - 1} (\alpha - 1)},\]

if ploidy > 1, and

\[m = 1 + \frac{1}{2^{\alpha - 1} (\alpha - 1)},\]

if ploidy = 1.

In the polyploid case we divide the population size \(N\) by two because we assume the \(N\) polyploid individuals form \(N / 2\) two-parent families in which reproduction takes place.

Warning

The number of generations between common ancestor events \(G\) depends both on the population size \(N\) and \(\alpha\), and can be dramatically shorter than in the case of the standard coalescent. For \(\alpha \approx 1\) that is due to insensitivity of \(G\) to \(N\) — see Multiple merger coalescents for an illustration. For \(\alpha \approx 2\), \(G\) is almost linear in \(N\), but can nevertheless be small because \(B(2 - \alpha, \alpha) \rightarrow \infty\) as \(\alpha \rightarrow 2\). As a result, population sizes must often be many orders of magnitude larger than census population sizes to obtain realistic amounts of diversity in simulated samples.

See Schweinsberg (2003) for the derivation of the common ancestor event rate, as well as the number of generations between common ancestor events. Note however that Schweinsberg (2003) only covers the haploid case. For details of the diploid extension, see Blath et al. (2013), and Birkner et al. (2018) for a diploid version of the Schweinsberg (2003) model specifically. The general polyploid model is analogous to the diploid case, with \(2 p\) available copies of parental chromosomes per common ancestor event, and hence up to \(2 p\) simultaneous mergers.

Parameters:
  • alpha (float) – Determines the degree of skewness in the family size distribution, and must satisfy \(1 < \alpha < 2\). Smaller values of \(\alpha\) correspond to greater skewness, and \(\alpha = 2\) would coincide with the standard coalescent.

  • truncation_point (float) – The maximum number of juveniles \(K\) born to one family as a fraction of the population size \(N\). Must satisfy \(0 < K \leq \inf\). Determines the maximum fraction of the population replaced by offspring in one reproduction event, \(\tau\), via \(\tau = K / (K + m)\), where \(m\) is the mean juvenile number above. The default is \(K = \inf\), which corresponds to the standard Beta-coalescent with \(\tau = 1\). When \(K < \inf\), the number of lineages participating in a common ancestor event is determined by moments of the Beta:math:(2 - alpha, alpha) distribution conditioned on not exceeding \(\tau\), and the Beta-function in the expression for \(G\) is replaced by the incomplete Beta-function \(B(\tau; 2 - \alpha, \alpha)\).

class msprime.DiracCoalescent(*, duration=None, psi=None, c=None)[source]#

A Lambda-coalescent with multiple mergers in the haploid cases, or a Xi-coalescent with simultaneous multiple mergers in the polyploid case.

The Dirac-coalescent is an implementation of the model of Blath et al. (2013) The simulation proceeds similarly to the standard coalescent. In addition to binary common ancestor events at rate \(n (n - 1) / 2\) when there are \(n\) lineages, potential multiple merger events take place at rate \(c > 0\). Each lineage participates in each multiple merger event independently with probability \(0 < \psi \leq 1\).

If ploidy = 1, then all participating lineages merge into one common ancestor, corresponding to haploid, single-parent reproduction. If ploidy = \(p > 1\), all participating lineages split randomly into \(2 p\) groups, corresponding to two-parent reproduction with \(p\) copies of each chromosome per parent. All lineages within each group merge simultaneously.

Warning

The Dirac-coalescent is obtained as a scaling limit of Moran models, rather than Wright-Fisher models. As a consequence, the number of generations between coalescence events is proportional to \(N^2\), rather than \(N\) generations as in the standard coalescent. See Multiple merger coalescents for an illustration of how this affects simulation output in practice.

Parameters:
  • c (float) – Determines the rate of potential multiple merger events. We require \(c > 0\).

  • psi (float) – Determines the fraction of the population replaced by offspring in one large reproduction event, i.e. one reproduction event giving rise to potential multiple mergers when viewed backwards in time. We require \(0 < \psi \leq 1\).

class msprime.SweepGenicSelection(*, duration=None, position=None, start_frequency=None, end_frequency=None, s=None, dt=None)[source]#

A selective sweep that has occurred in the history of the sample. This will lead to a burst of rapid coalescence near the selected site.

The strength of selection during the sweep is determined by the parameter \(s\). Here we define s such that the fitness of the three genotypes at our beneficial locus are \(W_{bb}=1\), \(W_{Bb}=1 + s/2\), \(W_{BB}=1 + s\). Thus fitness of the heterozygote is intermediate to the two homozygotes.

Warning

Models where start_frequency is \(\geq \frac{1}{ploidy*N}\) will not simulate the trajectory until its origination. In practical terms, this issue means that “sweeps from neutral standing genetic variation” are currently not possible. The tracking issue for this on GitHub is here.

The model is one of a structured coalescent where selective backgrounds are defined as in Braverman et al. (1995) The implementation details here follow closely those in discoal (Kern and Schrider, 2016), with the important difference that \(s\) in msprime is half of of that in discoal (i.e. for equivalent results use \(2s\)).

See Selective sweeps for examples and details on how to specify different types of sweeps.

Warning

Currently models with more than one population and a selective sweep are not implemented. Population size changes during the sweep are not yet possible in msprime.

Parameters:
  • position (float) – the location of the beneficial allele along the chromosome.

  • start_frequency (float) – population frequency of the beneficial allele at the start of the selective sweep. E.g., for a de novo allele in a diploid population of size N, start frequency would be \(1/2N\).

  • end_frequency (float) – population frequency of the beneficial allele at the end of the selective sweep.

  • s (float) – \(s\) is the selection coefficient of the beneficial mutation.

  • dt (float) – dt is the small increment of time for stepping through the sweep phase of the model. a good rule of thumb is for this to be approximately \(1/40N\) or smaller.

Mutations#

msprime.sim_mutations(tree_sequence, rate=None, *, random_seed=None, model=None, start_time=None, end_time=None, discrete_genome=None, keep=None)[source]#

Simulates mutations on the specified ancestry and returns the resulting tskit.TreeSequence. Mutations are generated at the specified rate per unit of sequence length, per unit of time. By default, mutations are generated at discrete sites along the genome and multiple mutations can occur at any given site. A continuous sequence, infinite-sites model can also be specified by setting the discrete_genome parameter to False.

If the model parameter is specified, this determines the model of sequence evolution under which mutations are generated. The default mutation model is the msprime.JC69, a symmetrical mutation model among the ACGT alleles. See the Models section for details of available models.

If a random_seed is specified, this is used to seed the random number generator. If the same seed is specified and all other parameters are equal then the same mutations will be generated. If no random seed is specified then one is generated automatically.

The time interval over which mutations can occur may be controlled using the start_time and end_time parameters. The start_time defines the lower bound (in time-ago) on this interval and max_time the upper bound. Note that we may have mutations associated with nodes with time <= start_time since mutations store the node at the bottom (i.e., towards the leaves) of the branch that they occur on.

If the tree sequence already has mutations, these are by default retained, but can be discarded by passing keep=False. However, adding new mutations to a tree sequence with existing mutations must be done with caution, since it can lead to incorrect or nonsensical results if mutation probabilities differ by ancestral state. (As an extreme example, suppose that X->Y and X->Z are allowable transitions, but Y->Z is not. If a branch already has an X->Y mutation on it, then calling sim_mutations(…, keep=True) might insert an X->Z mutation above the existing mutation, thus implying the impossible chain X->Y->Z.) However, the effect on nucleotide models of mutation are generally very small.

Note

Many mutation models will insert silent transitions (e.g., placing a mutation to A above an existing mutation to A). Such mutations are harmless and are required for us to guarantee the statistical properties of the process of sequentially adding mutations to a tree sequence.

Parameters:
  • tree_sequence (tskit.TreeSequence) – The tree sequence we wish to throw mutations onto.

  • rate (float) – The rate of mutation per unit of sequence length per unit time, as either a single number (for a uniform rate) or as a RateMap. (Default: 0).

  • random_seed (int) – The random seed. If this is None, a random seed will be automatically generated. Valid random seeds must be between 1 and \(2^{32} - 1\). See the Random seeds section for usage examples.

  • model (MutationModel) – The mutation model to use when generating mutations. This can either be a string (e.g., "jc69") or an instance of a simulation model class e.g, msprime.F84(kappa=0.5). If not specified or None, the JC69 mutation model is used. Please see the Models section for more details on specifying mutation models.

  • start_time (float) – The minimum time ago at which a mutation can occur. (Default: no restriction.)

  • end_time (float) – The maximum time ago at which a mutation can occur (Default: no restriction).

  • discrete_genome (bool) – Whether to generate mutations at only integer positions along the genome (Default=True).

  • keep (bool) – Whether to keep existing mutations. (default: True)

Returns:

The tskit.TreeSequence object resulting from overlaying mutations on the input tree sequence.

Return type:

tskit.TreeSequence

Models#

class msprime.MutationModel[source]#

Abstract superclass of msprime mutation models.

class msprime.MatrixMutationModel[source]#

Superclass of the specific mutation models with a finite set of states. This class can be used to create a matrix mutation model with custom alleles and probabilities. For details of how this works see Mutation Matrix Models Details.

Parameters:
  • alleles (list[str]) – A list of the possible alleles generated by the model. Each entry is a string.

  • root_distribution (list[float]) – An array of float values the same length as alleles which should sum to 1. These values are used to determine the ancestral state of each mutated site. For example if alleles is ['A','B'] and root_distribution is [1,0] all ancestral states will be 'A'.

  • transition_matrix (list[float]) – A square 2d-matrix of transition probabilities where both dimensions are equal to the length of the alleles argument. These are used to determine the derived state at each mutation. Note that rows should sum to 1.

class msprime.MicrosatMutationModel[source]#

General class for Microsatellite mutation models which parameterizes the transition matrix describing changes among alleles using five parameters according to the model of Sainudiin et al. (2004)

See also

See the Microsatellite Mutation Models section for more details and examples.

Concretely, mutation rates under this model are computed as follows. Let qij be the transition rate from equation (3) in Sainudiin et al, and let M be the largest row sum of the matrix whose i,j-th entry is qij and whose diagonals are zero. Then if the mutation rate passed to sim_mutations() is m, the mutation rate from allele i to allele j is m * qij / M.

Note that the values of lo and hi are in units of repeat number, and are unrelated to repeat length (e.g., the repeat itself can be dinucleotide, tri-nucleotide, etcetera).

Warning

The default values for lo and hi are chosen arbitrarily. The default of lo=1 is chosen so that all states might be represented, but please note that in empirical data, microsatellite loci are often ascertained based on the number of repeats observed. Please ensure that you set lo and hi to the appropriate values for your organism / locus of interest.

Parameters:
  • s (float) – strength of length dependence on mutation rate. Defaults to 0, i.e., no length dependence.

  • u (float) – Constant bias parameter; must be between 0 and 1. Defaults to 0.5, i.e., no bias.

  • v (float) – Linear bias parameter. Defaults to 0, i.e., no bias.

  • p (float) – Probability of a single step mutation. Defaults to 1.

  • m (float) – 1 / (mean multistep mutation size). Defaults to 1.

  • lo (int) – Lower bound on repeat number. Defaults to 2.

  • hi (int) – Upper bound on repeat number (inclusive). Defaults to 50.

  • root_distribution (list[float]) – An array of float values of length hi-lo+1 which should sum to 1, and give the probabilities used to choose the ancestral state of each microsatellite. Defaults to the stationary distribution of the model.

class msprime.SMM[source]#

Stepwise mutation model (Ohta and Kimura, 1978), for microsatellite repeat number.

This is a MicrosatMutationModel with alleles [lo, .. , hi], a root distribution (stationary distribution by default), and a transition matrix that allows only mutations that change the number of repeats by +/- 1. Concretely, if the mutation rate is m, then an allele k mutates to k+1 and k-1 at rate m/2 each, except that mutations above hi or below lo result in no change.

Parameters:
class msprime.TPM[source]#

Two-phase mutation model of DiRienzo et al. (1994)

This models evolution of microsatellite repeat number in a manner that allows for multi-step mutations in copy number of microsatellite repeats. This is parameterized by p the probability of a single step mutation, and m the success probability of the truncated gamma distribution describing the distribution of longer steps such that the mean multi-step mutation is length 1/m. For this model both p and m need to be set to < 1.0.

This is a MicrosatMutationModel with alleles [lo, .. , hi]. For a precise definition of the mutation rates, see MicrosatMutationModel with s=0, u=0.5, and v=0. See MicrosatMutationModel for further details of parameterization.

Parameters:
class msprime.EL2[source]#

Equal rate, linear biased, two-phase mutation model of Garza et al. (1995) as parameterized by Sainudiin et al. (2004)

This models evolution of microsatellite repeat number in a manner that allows for multi-step mutations in copy number. This is parameterized by m the success probability of the truncated gamma distribution describing the distribution of longer steps, such that the mean multi-step mutation is length 1/m. u the constant bias parameter which determines bias (if any) in expansion or contraction of the repeat number, and v the linear bias parameter which determines mutation rate variation associated with repeat number. For this model m need to be set to < 1.0, and u needs to be set to between 0 and 1. and v can take any real, positive or negative value.

See MicrosatMutationModel for details on the constant bias parameter u and a linear bias parameter v.

This is a MicrosatMutationModel with alleles [lo, .. , hi] For a precise definition of the mutation rates, see MicrosatMutationModel with s=0.

Parameters:
class msprime.BinaryMutationModel[source]#

Basic binary mutation model with two alleles: "0" and "1", and ancestral allele always "0".

This is a MatrixMutationModel with alleles ["0", "1"], root distribution [1.0, 0.0], and transition matrix that is either [[0.0, 1.0], [1.0, 0.0]] (by default), or [[0.5, 0.5], [0.5, 0.5]] (if state_independent=True).

Parameters:

state_independent (bool) – Whether mutations will be generated in a way independent of the previous allelic state, which includes silent mutations. See Existing mutations for more details.

class msprime.JC69[source]#

Jukes-Cantor mutation model (Jukes and Cantor 1969). Based on the standard ACGT nucleotides as allelic states, this model assumes equal probabilities for ancestral state and equal probabilities for all possible transitions.

This is a MatrixMutationModel with alleles ["A", "C", "G", "T"], root distribution [1/4, 1/4, 1/4, 1/4], and default transition matrix [[0, 1/3, 1/3, 1/3], [1/3, 0, 1/3, 1/3], [1/3, 1/3, 0, 1/3], [1/3, 1/3, 1/3, 0]]. It has no free parameters.

If state_independent is True, then instead all entries of the transition matrix will be 1/4, so one-quarter of the mutations will be silent. See Existing mutations for more details.

Citation: Jukes TH, Cantor CR (1969). Evolution of Protein Molecules. pp. 21-132.

Parameters:

state_independent (bool) – Whether mutations will be generated in a way independent of the previous allelic state.

class msprime.HKY[source]#

The Hasegawa, Kishino and Yano mutation model (Hasegawa et al. 1985). Based on the standard ACGT nucleotides as allelic states, this model allows different rates for transitions and transversions, and sets an equilibrium frequency for each nucleotide. In addition a custom ancestral frequency (root_distribution) can be specified. With kappa=1.0 and the default values of the other arguments this model is equal to JC69. This model is similar to F84 but with a differing parameterisation for kappa.

This model is parameterised by \(\kappa\) (kappa), the ratio of transition to transversion mutation rates, and \(\pi\) (equilibrium_frequencies), the vector of equilibrium nucleotide frequencies. If this mutation model is used with a mutation_rate of \(\mu\), then the mutation rate from the i-th to the j-th allele is equal to \(Q_{ij}\), where

\(\mathbf{Q} = \frac{\mu}{M} \times \begin{bmatrix} \cdot & \pi_C & \kappa \pi_G & \pi_T \\ \pi_A & \cdot & \pi_G & \kappa \pi_T \\ \kappa \pi_A & \pi_C & \cdot & \pi_T \\ \pi_A & \kappa \pi_C & \pi_G & \cdot \end{bmatrix}\)

Here \(M\) is an overall scaling factor on the mutation rate that can be computed as follows: let \(q_{ij} = \pi_j\) if i<->j is a transversion, and \(q_{ij} = \kappa \pi_j\) otherwise. Set \(q_{ii} = 0\), and then let \(M\) be the largest row sum of \(q\), i.e., \(M = \max_i \left( \sum_j q_{ij} \right)\). Then, this implementation as a MatrixMutationModel has the transition matrix \(P_{ij} = q_{ij} / M\), and \(P_{ii} = 1 - \sum_{j\neq i} P_{ij}\).

Note also that \(\kappa\) is the ratio of individual mutation rates, not the ratio of total transition to transversion rates, which would be \(\kappa/2\), as is used in some parameterisations.

If state_independent is true, then the above is modified by setting \(q_{ii} = \kappa \pi_i\). This makes the model completely state-independent only if \(\kappa = 1\) by adding additional silent mutations. See Existing mutations for more details.

Citation: Hasegawa M, Kishino H, Yano T (1985). “Dating of the human-ape splitting by a molecular clock of mitochondrial DNA”. Journal of Molecular Evolution. 22 (2): 160–74.

Note that this implementation has the root distribution as a separate parameter, although it defaults to the equilibrium distribution. If you set the root distribution to be different than the equilibrium distribution, then you have a nonequilibrium model, and you should make sure that’s what you want.

Parameters:
  • kappa (float) – Scaling parameter by which the transition matrix is modified for transitions (A<>G, C<>T). For example a value of 0 means that no transitions will occur, 1 makes the probability of transitions and transversions equal.

  • equilibrium_frequencies (list[float]) – An array of float values of length 4 which should sum to 1. These values are used to determine equilibrium frequencies (long-term averages) of the alleles. Defaults to [0.25, 0.25, 0.25, 0.25], i.e., all nucleotides equally likely.

  • root_distribution (list[float]) – An array of float values of length 4 which should sum to 1. These values are used to determine the ancestral state of each mutational site. Defaults to the value of equilibrium_frequencies.

  • state_independent (bool) – Whether to include a higher rate of silent mutations that makes the model closer to state-independent.

class msprime.F84[source]#

The F84 mutation model (Felsenstein and Churchill, 1996). Based on the standard ACGT nucleotides as allelic states, this model takes into account transitions and transversions, and sets an equilibrium frequency for each nucleotide. In addition a custom ancestral frequency (root_distribution) can be specified. With kappa=1.0 and the default values of the other arguments this model is equal to JC69. This model is similar to HKY but with a differing parameterisation for kappa.

This model is parameterised by \(\kappa\) (kappa), the ratio of transition to transversion mutation rates, and \(\pi\) (equilibrium_frequencies), the vector of equilibrium nucleotide frequencies. If this mutation model is used with a mutation_rate of \(\mu\), then the mutation rate from the i-th to the j-th allele is equal to \(Q_{ij}\), where

\(\mathbf{Q} = \frac{\mu}{M} \times \begin{bmatrix} \cdot & \pi_C & \left(1 + \frac{\kappa-1}{\pi_A + \pi_G}\right) \pi_G & \pi_T \\ \pi_A & \cdot & \pi_G & \left(1 + \frac{\kappa-1}{\pi_C + \pi_T}\right) \pi_T \\ \left(1 + \frac{\kappa-1}{\pi_A + \pi_G}\right) \pi_A & \pi_C & \cdot & \pi_T \\ \pi_A & \left(1 + \frac{\kappa-1}{\pi_C + \pi_T}\right) \pi_C & \pi_G & \cdot \end{bmatrix}\)

Here \(M\) is an overall scaling factor on the mutation rate that can be computed as follows: let \(q_{ij} = \pi_j\) if i<->j is a transversion, \(q_{ij} = (1 + (\kappa-1)/(\pi_A + \pi_G)) \pi_j\) if i and j are A and G in some order, \(q_{ij} = (1 + (\kappa-1)/(\pi_C + \pi_T)) \pi_j\) if i and j are C and T in some order, and \(q_{ii} = 0\). Let \(M\) be the largest row sum of \(q\), i.e., \(M = \max_i \left( \sum_j q_{ij} \right)\). Then, this implementation as a MatrixMutationModel has the transition matrix \(P_{ij} = q_{ij} / M\), and \(P_{ii} = 1 - \sum_{j\neq i} P_{ij}\).

If state_independent is true, then the above is modified by setting \(q_{ii} = \kappa \pi_i\). This makes the model completely state-independent only if \(\kappa = 1\) by adding additional silent mutations. See Existing mutations for more details.

Citation: Felsenstein J, Churchill GA (January 1996). “A Hidden Markov Model approach to variation among sites in rate of evolution”. Molecular Biology and Evolution. 13 (1): 93–104.

Note that this implementation has the root distribution as a separate parameter, although it defaults to the equilibrium distribution. If you set the root distribution to be different than the equilibrium distribution, then you have a nonequilibrium model, and you should make sure that’s what you want.

Parameters:
  • kappa (float) – Scaling parameter by which the transition matrix is modified for transitions (A<>G, C<>T). Must be greater than or equal to the larger of \((\pi_A+\pi_G)/(\pi_C+\pi_T)\) and \((\pi_C+\pi_T)/(\pi_A+\pi_G)\).

  • equilibrium_frequencies (list[float]) – An array of float values of length 4 which should sum to 1. These values are used to determine equilibrium frequencies (long-term averages) of the alleles. Defaults to [0.25, 0.25, 0.25, 0.25], i.e all nucleotides equally likely.

  • root_distribution (list[float]) – An array of float values of length 4 which should sum to 1. These values are used to determine the ancestral state of each mutational site. Defaults to the value of equilibrium_frequencies.

  • state_independent (bool) – Whether to include a higher rate of silent mutations that makes the model closer to state-independent.

class msprime.GTR[source]#

The Generalised Time-Reversible nucleotide mutation model, a general parameterisation of a time-reversible mutation process (Tavaré et al. 1986). It allows specification of per-nucleotide equilibrium frequencies and equilibrium transition rates.

This model is parameterised by the vector \(r\) (relative_rates), and \(\pi\) (equilibrium_frequencies), the vector of equilibrium nucleotide frequencies. The entries of relative_rates are, in this order, \((r_{AC}, r_{AG}, r_{AT}, r_{CG}, r_{CT}, r_{GT})\). If this mutation model is used with a mutation_rate of \(\mu\), then the mutation rate from the i-th to the j-th allele is \(Q_{ij}\), where

\(\mathbf{Q} = \frac{\mu}{M} \times \begin{bmatrix} \cdot & r_{AC} \pi_C & r_{AG} \pi_G & r_{AT} \pi_T \\ r_{AC} \pi_A & \cdot & r_{CG} \pi_G & r_{CT} \pi_T \\ r_{AG} \pi_A & r_{CG} \pi_C & \cdot & r_{GT} \pi_T \\ r_{AT} \pi_A & r_{CT} \pi_C & r_{GT} \pi_G & \cdot \end{bmatrix}\)

Here \(M\) is an overall scaling factor on the mutation rate that can be computed as follows: let \(q_{ij} = r_{ij} \pi_j\) and \(q_{ii} = 0\), and then let \(M\) be the largest row sum of \(q\), i.e., \(M = \max_i \left( \sum_j q_{ij} \right)\). Then, this implementation as a MatrixMutationModel has the transition matrix \(P_{ij} = q_{ij} / M\), and \(P_{ii}\) chosen so rows sum to one.

If state_independent is true, then the above is modified by setting \(q_{ii} = \pi_i\). This makes the model completely state-independent only if all \(r_{ij} = 1\) by adding additional silent mutations. See Existing mutations for more details.

Citation: Tavaré S (1986). “Some Probabilistic and Statistical Problems in the Analysis of DNA Sequences”. Lectures on Mathematics in the Life Sciences. 17: 57–86.

Note that this implementation has the root distribution as a separate parameter, although it defaults to the equilibrium distribution. If you set the root distribution to be different than the equilibrium distribution, then you have a nonequilibrium model, and you should make sure that’s what you want.

Parameters:
  • relative_rates (list[float]) – Controls the relative rates of mutation between each pair of nucleotides, in order “A<>C”, “A<>G”, “A<>T”, “C<>G”, “C<>T”, and “G<>T”.

  • equilibrium_frequencies (list[float]) – The equilibrium base frequencies, a list of four probabilities that sum to one. (Default: equal frequencies.)

  • root_distribution (list[float]) – An array of float values of length 4 which should sum to 1. These values are used to determine the ancestral state of each mutational site. Defaults to the value of equilibrium_frequencies.

  • state_independent (bool) – Whether to include a higher rate of silent mutations that makes the model closer to state-independent.

class msprime.BLOSUM62[source]#

The BLOSUM62 model of time-reversible amino acid mutation. This model has no free parameters.

The model is parameterised by a 20-by-20 symmetric matrix of relative rates, \(B\), and a vector of amino acid equilibrium frequencies, \(\pi\) (for the precise order, see this model’s .alleles attribute). If this mutation model is used with a mutation_rate of \(\mu\), then the mutation rate from the i-th to the j-th allele is \(Q_{ij}\), where

\(Q_{ij} = \frac{\mu}{M} B_{ij} \pi_j,\)

where \(M\) is an overall scaling factor chosen so that the largest row sum in the matrix \(B_{ij} \pi_j /M\) is equal to one. In this model, \(M = 1.726203705809619\). This implementation as a MatrixMutationModel has transition matrix \(P_{ij} = B_{ij} \pi_{j} / M\), and \(P_{ii}\) chosen so rows sum to one, and root distribution equal to the equilibrium frequencies, \(\pi\), so the matrix \(B\) can be recovered as follows:

model = msprime.BLOSUM62()
M = 1.726203705809619
B = M * model.transition_matrix / model.root_distribution

Citation: The values of \(B\) and \(\pi\) used here were copied from Seq-Gen. The original citation is: Henikoff, S., and J. G. Henikoff. 1992. PNAS USA 89:10915-10919, and the numerical values were provided in: Yu,Y.-K., Wootton, J.C. and Altschul, S.F. (2003) The compositional adjustment of amino acid substitution matrices. Proc. Natl Acad. Sci., USA, 100, 15688–15693.

class msprime.PAM[source]#

The PAM model of time-reversible amino acid mutation. This model has no free parameters.

The model is parameterised by a 20-by-20 symmetric matrix of relative rates, \(B\), and a vector of amino acid equilibrium frequencies, \(\pi\) (for the precise order, see this model’s .alleles attribute). If this mutation model is used with a mutation_rate of \(\mu\), then the mutation rate from the i-th to the j-th allele is \(Q_{ij}\), where

\(Q_{ij} = \frac{\mu}{M} B_{ij} \pi_j,\)

where \(M\) is an overall scaling factor chosen so that the largest row sum in the matrix \(B_{ij} \pi_j /M\) is equal to one. In this model, \(M = 1.78314862248\). This implementation as a MatrixMutationModel has transition matrix \(P_{ij} = B_{ij} \pi_{j} / M\), and \(P_{ii}\) chosen so rows sum to one, and root distribution equal to the equilibrium frequencies, \(\pi\), so the matrix \(B\) can be recovered as follows:

model = msprime.PAM()
M = 1.78314862248
B = M * model.transition_matrix / model.root_distribution

Citation: The values of \(B\) and \(\pi\) used here were copied from Seq-Gen, and follow the “Dayhoff DCMut” model as described in Kosiol, C., and Goldman, N. 2005. Different versions of the Dayhoff rate matrix. Molecular Biology and Evolution 22:193-199. The original citation is: Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C. (1978). A model of evolutionary change in proteins. Atlas of Protein Sequence Structures, Vol. 5, Suppl. 3, National Biomedical Research Foundation, Washington DC, pp. 345-352.

class msprime.InfiniteAlleles[source]#

An infinite alleles model of mutation in which each allele is a unique positive integer. This works by keeping track of a “next allele”: each time the model is asked to produce a new allele (either for an ancestral or derived state), the “next allele” is provided, and then incremented.

Variables:

next_allele: The next allele to be assigned. This increments by one each time an allele is assigned to a new mutation, and resets to start_allele each time a new ancestral state is assigned.

Parameters:

start_allele (int) – The nonnegative integer to start assigning alleles from. (default: 0)

class msprime.SLiMMutationModel[source]#

An infinite-alleles model of mutation producing “SLiM-style” mutations.

To agree with mutations produced by SLiM, the ancestral state of each new site is set to the empty string, and each derived state is produced by appending the “next allele” to the previous state. The result is that each allele is a comma-separated string of all mutations that have occurred up to the root.

Mutations produced by SLiM carry both a time attribute, in units of “time ago” as usual for tskit, as well as a metadata attributed called “origin_generation”, in units of time since the start of the simulation. Adding these two together - time since the start of the simulation plus time until the end - is equal to the total number of generations of the simulation. The origin_generation is not currently used by SLiM, but for consistency, the origin_generation attribute for mutations produced by this model is set equal to slim_generation minus floor(mut.time), where mut.time is the (tskit) time ago of the mutation.

Parameters:
  • type (int) – The nonnegative integer defining the “type” of SLiM mutation that will be recorded in metadata.

  • next_id (int) – The nonnegative integer to start assigning alleles from. (default: 0)

  • slim_generation (int) – The “SLiM generation” time used in determining the “origin_generation” metadata attribute of mutations. This can usually be left at its default, which is 1.

  • block_size (int) – The block size for allocating derived states. You do not need to change this unless you get an “out of memory” error due to a very large number of stacked mutations.

Node flags#

In the tskit Node Table node flags specify particular properties about nodes. Msprime follows the standard approach of setting the tskit.NODE_IS_SAMPLE flag for all sample nodes, with all other nodes having a flags value of 0.

Msprime defines some extra flags that help us to identify particular nodes in some situations:

Note

These flags have been deprecated in favour of using NodeType. This is not a breaking change. The constant associated with each flag remains the same. However, working with flags should be more intuitive now that we are relying on the enum.Flag functionality.

msprime.NODE_IS_RE_EVENT#

The node is an ARG recombination event. Each recombination event is marked with two nodes, one identifying the individual providing the genetic material to the left of the breakpoint and the other providing the genetic material the right. Present either with the record_full_arg flag or when adding recombination to the additional_nodes to be stored.

msprime.NODE_IS_CA_EVENT#

The node is an ARG common ancestor event that did not result in marginal coalescence. Present either with the record_full_arg flag or when adding common ancestor events to the additional_nodes to be stored.

msprime.NODE_IS_MIG_EVENT#

The node is an ARG migration event identifying the individual that migrated. Can be used in combination with the record_migrations option. Present either with the record_full_arg flag or when adding migration events to the additional_nodes to be stored.

msprime.NODE_IS_CEN_EVENT#

The node was created by a census event. Please see the Census events section for more details.

msprime.NODE_IS_GC_EVENT#

The node was created by a gene conversion event.

msprime.NODE_IS_PASS_THROUGH#

The node identifies an ancestral genome/ploid through which the ancestral material of only a single lineage passed (so no coalescence or common ancestor event). Can be used in combination with the record_migrations option. Only present when storing pass through events using the additional_nodes option. And only compatible with DTWF and FixedPedigree models.

Rate maps#

class msprime.RateMap(*, position, rate)[source]#

A class mapping a non-negative rate value to a set of non-overlapping intervals along the genome. Intervals for which the rate is unknown (i.e., missing data) are encoded by NaN values in the rate array.

Parameters:
  • position (list) – A list of \(n+1\) positions, starting at 0, and ending in the sequence length over which the RateMap will apply.

  • rate (list) – A list of \(n\) positive rates that apply between each position. Intervals with missing data are encoded by NaN values.

property left#

The left position of each interval (inclusive).

property right#

The right position of each interval (exclusive).

property mid#

Returns the midpoint of each interval.

property span#

Returns the span (i.e., right - left) of each of the intervals.

property position#

The breakpoint positions between intervals. This is equal to the left array with the sequence_length appended.

property rate#

The rate associated with each interval. Missing data is encoded by NaN values.

property mass#

The “mass” of each interval, defined as the rate \(\times\) span. This is NaN for intervals containing missing data.

property missing#

A boolean array encoding whether each interval contains missing data. Equivalent to np.isnan(rate_map.rate)

property non_missing#

A boolean array encoding whether each interval contains non-missing data. Equivalent to np.logical_not(np.isnan(rate_map.rate))

property num_intervals#

The total number of intervals in this map. Equal to num_missing_intervals + num_non_missing_intervals.

property num_missing_intervals#

Returns the number of missing intervals, i.e., those in which the rate value is a NaN.

property num_non_missing_intervals#

The number of non missing intervals, i.e., those in which the rate value is not a NaN.

property sequence_length#

The sequence length covered by this map

property total_mass#

The cumulative total mass over the entire map.

property mean_rate#

The mean rate over this map weighted by the span covered by each rate. Unknown intervals are excluded.

get_rate(x)[source]#

Return the rate at the specified list of positions.

Note

This function will return a NaN value for any positions that contain missing data.

Parameters:

x (numpy.ndarray) – The positions for which to return values.

Returns:

An array of rates, the same length as x.

Return type:

numpy.ndarray

get_cumulative_mass(x)[source]#

Return the cumulative mass of the map up to (but not including) a given point for a list of positions along the map. This is equal to the integral of the rate from 0 to the point.

Parameters:

x (numpy.ndarray) – The positions for which to return values.

Returns:

An array of cumulative mass values, the same length as x

Return type:

numpy.ndarray

find_index(x)[source]#

Returns the index of the interval that the specified position falls within, such that rate_map.left[index] <= x < self.rate_map.right[index].

Parameters:

x (float) – The position to search.

Returns:

The index of the interval containing this point.

Return type:

int

Raises:

KeyError if the position is not contained in any of the intervals.

missing_intervals()[source]#

Returns the left and right coordinates of the intervals containing missing data in this map as a 2D numpy array with shape (num_missing_intervals, 2). Each row of this returned array is therefore a left, right tuple corresponding to the coordinates of the missing intervals.

Returns:

A numpy array of the coordinates of intervals containing missing data.

Return type:

numpy.ndarray

copy()[source]#

Returns a deep copy of this RateMap.

slice(left=None, right=None, *, trim=False)[source]#

Returns a subset of this rate map in the specified interval.

Parameters:
  • left (float) – The left coordinate (inclusive) of the region to keep. If None, defaults to 0.

  • right (float) – The right coordinate (exclusive) of the region to keep. If None, defaults to the sequence length.

  • trim (bool) – If True, remove the flanking regions such that the sequence length of the new rate map is right - left. If False (default), do not change the coordinate system and mark the flanking regions as “unknown”.

Returns:

A new RateMap instance

Return type:

RateMap

static uniform(sequence_length, rate)[source]#

Create a uniform rate map

static read_hapmap(fileobj, sequence_length=None, *, has_header=True, position_col=None, rate_col=None, map_col=None)[source]#

Parses the specified file in HapMap format and returns a RateMap. HapMap files must white-space-delimited, and by default are assumed to contain a single header line (which is ignored). Each subsequent line then contains a physical position (in base pairs) and either a genetic map position (in centiMorgans) or a recombination rate (in centiMorgans per megabase). The value in the rate column in a given line gives the constant rate between the physical position in that line (inclusive) and the physical position on the next line (exclusive). By default, the second column of the file is taken as the physical position and the fourth column is taken as the genetic position, as seen in the following sample of the format:

Chromosome  Position(bp)  Rate(cM/Mb)  Map(cM)
chr10       48232         0.1614       0.002664
chr10       48486         0.1589       0.002705
chr10       50009         0.159        0.002947
chr10       52147         0.1574       0.003287
...
chr10       133762002     3.358        181.129345
chr10       133766368     0.000        181.144008

In the example above, the first row has a nonzero genetic map position (last column, cM), implying a nonzero recombination rate before that position, that is assumed to extend to the start of the chromosome (at position 0 bp). However, if the first line has a nonzero bp position (second column) and a zero genetic map position (last column, cM), then the recombination rate before that position is unknown, producing missing data.

Note

The rows are all assumed to come from the same contig, and the first column is currently ignored. Therefore if you have a single file containing several contigs or chromosomes, you must must split it up into multiple files, and pass each one separately to this function.

Parameters:
  • fileobj (str) – Filename or file to read. This is passed directly to numpy.loadtxt(), so if the filename extension is .gz or .bz2, the file is decompressed first

  • sequence_length (float) – The total length of the map. If None, then assume it is the last physical position listed in the file. Otherwise it must be greater then or equal to the last physical position in the file, and the region between the last physical position and the sequence_length is padded with a rate of zero.

  • has_header (bool) – If True (default), assume the file has a header row and ignore the first line of the file.

  • position_col (int) – The zero-based index of the column in the file specifying the physical position in base pairs. If None (default) assume an index of 1 (i.e. the second column).

  • rate_col (int) – The zero-based index of the column in the file specifying the rate in cM/Mb. If None (default) do not use the rate column, but calculate rates using the genetic map positions, as specified in map_col. If the rate column is used, the interval from 0 to first physical position in the file is marked as unknown, and the last value in the rate column must be zero.

  • map_col (int) – The zero-based index of the column in the file specifying the genetic map position in centiMorgans. If None (default), assume an index of 3 (i.e. the fourth column). If the first genetic position is 0 the interval from position 0 to the first physical position in the file is marked as unknown. Otherwise, act as if an additional row, specifying physical position 0 and genetic position 0, exists at the start of the file.

Returns:

A RateMap object.

Return type:

RateMap

Demography#

class msprime.Demography[source]#

The definition of a demographic model for an msprime simulation, consisting of a set of populations, a migration matrix, and a list of demographic events. See the Demographic models section for detailed documentation on how to define, debug and simulate with demography in msprime.

Please see the Demography objects section for details of how to access and update population information within a model.

Demography objects implement the Python collections.abc.Mapping protocol, in which the keys are either the population name or integer id values (see the Identifying populations section for more information) and the values are Population objects.

In general, population references in methods such as Demography.add_population_split() can either be string names or integer IDs, and the two forms can be used interchangeably.

add_population(*, initial_size, growth_rate=None, name=None, description=None, extra_metadata=None, default_sampling_time=None, initially_active=None)[source]#

Adds a new Population to this Demography with the specified parameters. The new population will have ID equal to the the number of populations immediately before add_population is called, such that the first population added has ID 0, the next ID 1 and so on. If the name is not specified, this defaults to "pop_{id}". An Initial size value must be specified (but may be zero).

Parameters:
  • initial_size (float) – The number of individuals of the population at time zero. See the Initial size section for more details and examples.

  • growth_rate (float) – The exponential growth rate of the population. See the Growth rate section for more details and examples.

  • name (str) – The human-readable identifier for this population. If not specified, defaults to the string "pop_{id}" where id is the population’s integer ID. See Identifying populations for more details and recommendations on best practise.

  • description (str) – A concise but informative description of what this population represents within the wider model. Defaults to the empty strings.

  • extra_metadata (dict) – Extra metadata to associate with this population that will be stored tree sequences output by sim_ancestry(). See the Metadata section for more details and examples.

  • default_sampling_time (float) – The time at which samples will be taken from this population, if a time in not otherwise specified. By default this is determined by the details of the model, and whether populations are ancestral in Population split events. See the Default sampling time section for more details.

  • initially_active (bool) – Whether this population is initially active. By default this is determined by the details of the model, and whether populations are ancestral in Population split events. See the Life cycle section for more details.

Returns:

The new Population instance.

Return type:

Population

set_migration_rate(source, dest, rate)[source]#

Sets the backwards-time rate of migration from the specified source population to dest to the specified value. This has the effect of setting demography.migration_matrix[source, dest] = rate. It is the rate at which a lineage currently in source moves to dest as one follows the lineage back through time.

Important

Note this is the backwards in time; migration rate and that source and dest are from the perspective of lineages in the coalescent process. See Migration for more details and clarification on this vital point.

The source and dest populations can be referred to either by their integer id or string name values.

Parameters:
  • source (str,int) – The source population from which lineages originate in the backwards-time process.

  • dest (str,int) – The destination population where lineages are move to in the backwards-time process.

  • rate (float) – The per-generation migration rate.

set_symmetric_migration_rate(populations, rate)[source]#

Sets the symmetric migration rate between all pairs of populations in the specified list to the specified value. For a given pair of population IDs j and k, this sets demography.migration_matrix[j, k] = rate and demography.migration_matrix[k, j] = rate.

Populations may be specified either by their integer IDs or by their string names.

Parameters:
  • populations (list) – An iterable of population identifiers (integer IDs or string names).

  • rate (float) – The value to set the migration matrix entries to.

add_population_split(time, *, derived, ancestral)[source]#

Adds a population split event at the specified time. In a population split event all lineages from the (more recent) derived populations move to the (more ancient) ancestral population. Forwards in time, this corresponds to the ancestral population splitting into the derived populations.

See the Population split section for more details and examples.

In addition to moving lineages from the derived population(s) into the ancestral population, a population split has the following additional effects:

  • All derived populations are set to inactive.

  • All migration rates to and from the derived populations are set to 0.

  • Population sizes and growth rates for the derived populations are set to 0.

  • The default_sampling_time of the ancestral Population is set to the time of this event, if the default_sampling_time for the ancestral population has not already been set.

Parameters:
  • time (float) – The time at which this event occurs in generations.

  • derived (list(str, int)) – The derived populations.

  • ancestral (str, int) – The ancestral population.

add_admixture(time, *, derived, ancestral, proportions)[source]#

Adds an admixture event at the specified time. In an admixture event all lineages from a (more recent) derived population move to a list of (more ancient) ancestral populations according to a list of proportions, such that a given lineage has a probability proportions[j] of being moved to the population ancestral[j]. This movement of lineages backwards in time corresponds to the initial state of the admixed derived population the specified time being composed of individuals from the specified ancestral populations in the specified proportions.

See the Admixture section for more details and examples.

In addition to moving lineages from the derived population into the ancestral population(s), an admixture has the following additional effects:

  • The derived population is set to inactive.

  • The ancestral populations are set to active, if they are not already active.

  • All migration rates to and from the derived population are set to 0.

  • Population sizes and growth rates for the derived population are set to 0, and the population is marked as inactive.

Parameters:
  • time (float) – The time at which this event occurs in generations.

  • derived (str, int) – The derived population.

  • ancestral (list(str, int)) – The ancestral populations.

  • proportions (list(float)) – The proportion of the derived population from each of the ancestral populations at the time of the event.

add_mass_migration(time, *, source, dest, proportion)[source]#

Adds a mass migration (or “pulse migration”) event at the specified time. In a mass migration event, lineages in the source population are moved to the dest population with probability proportion. Forwards-in-time, this corresponds to individuals migrating from population dest to population source.

Please see the Pulse (mass) migration section for more details and examples.

Warning

Mass migrations are an advanced feature and should only be used if the required population dynamics cannot be modelled by Population split or Admixture events.

Important

Note that source and dest are from the perspective of the coalescent process, i.e. backwards in time; please see the Migration section for more details.

Parameters:
  • time (float) – The time at which this event occurs in generations.

  • source (str, int) – The population from which lineages are moved.

  • dest (str, int) – The population to which lineages are moved.

  • proportion (float) – For each lineage in the source population, this is the probability that it moves to the dest population.

add_migration_rate_change(time, *, rate, source=None, dest=None)[source]#

Changes the rate of migration from one deme to another to a new value at a specific time. Migration rates are specified in terms of the rate at which lineages move from population source to dest during the progress of the simulation.

Important

Note that source and dest are from the perspective of the coalescent process, i.e. backwards in time; please see the Migration section for more details.

By default, source=None and dest=None, which results in all non-diagonal elements of the migration matrix being changed to the new rate. If source and dest are specified, they must refer to valid populations (either integer IDs or string names).

Parameters:
  • time (float) – The time at which this event occurs in generations.

  • rate (float) – The new per-generation migration rate.

  • source (int) – The ID of the source population.

  • dest (str, int) – The ID of the destination population.

  • source – The source population ID.

add_symmetric_migration_rate_change(time, populations, rate)[source]#

Sets the symmetric migration rate between all pairs of populations in the specified list to the specified value. For a given pair of population IDs j and k, this sets migration_matrix[j, k] = rate and migration_matrix[k, j] = rate.

Please see the Migration section for more details.

Populations may be specified either by their integer IDs or by their string names.

Parameters:
  • time (float) – The time at which this event occurs in generations.

  • populations (list) – An sequence of population identifiers (integer IDs or string names).

  • rate (float) – The new migration rate.

add_population_parameters_change(time, *, initial_size=None, growth_rate=None, population=None)[source]#

Changes the size parameters of a population (or all populations) at a given time.

Please see the Populations section for more details.

Parameters:
  • time (float) – The length of time ago at which this event occurred.

  • initial_size (float) – The number of individuals in the population at the beginning of the time slice starting at time. If None, the initial_size of the population is computed according to the initial population size and growth rate over the preceding time slice.

  • growth_rate (float) – The new per-generation growth rate. If None, the growth rate is not changed. Defaults to None.

  • population (str, int) – The ID of the population affected. If population is None, the changes affect all populations simultaneously.

add_simple_bottleneck(time, population, proportion=None)[source]#

Adds a population bottleneck at the specified time in which each lineage has probability equal to proportion of coalescing into a single ancestor.

Please see the Simple bottleneck section for more details.

Parameters:
  • time (float) – The length of time ago at which this event occurred.

  • population (str, int) – The ID of the population affected.

  • proportion (float) – The probability of each lineage coalescing into a single ancestor. Defaults to 1.0.

add_instantaneous_bottleneck(time, *, population, strength)[source]#

Adds a bottleneck at the specified time in the specified population that is equivalent to the coalescent process running for strength generations.

Please see the Instantaneous bottleneck section for more details.

Note

The ploidy is also use to scale the time scale of the coalescent process during the bottleneck.

Parameters:
  • time (float) – The length of time ago at which this event occurred.

  • population (str, int) – The ID of the population affected.

  • strength (float) – The equivalent amount of time in the standard coalescent.

add_census(time)[source]#

Adds a “census” event at the specified time. In a census we add a node to each branch of every tree, thus recording the population that each lineage is in at the specified time.

This may be used to record all ancestral haplotypes present at that time, and to extract other information related to these haplotypes: for instance to trace the local ancestry of a sample back to a set of contemporaneous ancestors, or to assess whether a subset of samples has coalesced more recently than the census time. (However, these added nodes will only represent the portions of ancestral genomes inherited by the samples, rather than complete ancestral genomes.)

See Census events for more details.

Warning

When used in the conjunction with the DTWF model non-integer census times should be used to guarantee that the census nodes don’t coincide with coalescences (and therefore zero branch length errors). See Using the DTWF model for more details.

Parameters:

time (float) – The time at which the census should occur.

validate()[source]#

Checks the demography looks sensible and raises errors/warnings appropriately, and return a copy in which all default values have been appropriately resolved.

copy(populations=None)[source]#

Returns a copy of this model. If the populations argument is specified, the populations in the copied model will be in this order.

Parameters:

populations (list) – A list of population identifiers defining the order of the populations in the new model. If not specified, the current order is used.

Returns:

A copy of this Demography.

debug()[source]#

Returns a DemographyDebugger instance for this demography.

Returns:

A DemographyDebugger object for this demography.

Return type:

msprime.DemographyDebugger

assert_equal(other)[source]#

Compares this Demography with specified other and raises an AssertionError if they are not exactly equal.

Parameters:

other (Demography) – The other demography to compare against.

is_equivalent(other, rel_tol=None, abs_tol=None)[source]#

Compares this demography with the other and return True if they are equivalent up to the specified numerical tolerances. Two demographies are equivalent if, they have the same set of epochs defined by demographic events, and for each epoch:

  • The population’s initial_size, growth_rate and active values are equal in all populations.

  • The migration matrices are equal

  • The same sequence of lineage movements through population splits, etc.

All numerical comparisons are performed using math.isclose().

Parameters:
  • other (Demography) – The other demography to compare against.

  • rel_tol (float) – The relative tolerance used by math.isclose.

  • abs_tol (float) – The relative tolerance used by math.isclose.

Returns:

True if this demography and other are equivalent up to numerical tolerances.

Rtype bool:

bool

static from_species_tree(tree, initial_size, *, time_units='gen', generation_time=None, growth_rate=None)[source]#

Parse a species tree in Newick format and return the corresponding Demography object. The tree is assumed to be rooted and ultrametric and branch lengths must be included and correspond to time, either in units of millions of years, years, or generations.

The returned Demography object contains a Population for each node in the species tree. The population’s name attribute will be either the corresponding node label from the newick tree, if it exists, or otherwise the name takes the form “pop_{j}”, where j is the position of the given population in the list. Leaf populations are first in the list, and added in left-to-right order. Populations corresponding to the internal nodes are then added in a postorder traversal of the species tree. For each internal node a Population split event is added so that lineages move from its child populations at the appropriate time and rates of continuous migration to and from the child populations is set to zero. See the Population split section for more details.

The initial sizes and growth rates for the populations in the model are set via the initial_size and growth_rate arguments. These can be specified in two ways: if a single number is provided, this is used for all populations. The argument may also be a mapping from population names to their respective values. For example:

tree = "(A:10.0,B:10.0)C"
initial_size = {"A": 1000, "B": 2000, "C": 100}
demography = msprime.Demography.from_species_tree(tree, initial_size)

Note that it is possible to have default population sizes for unnamed ancestral populations using a collections.defaultdict, e.g.,

tree = "(A:10.0,B:10.0)"
initial_size = collections.defaultdict(lambda: 100)
initial_size.update({"A": 1000, "B": 2000})
demography = msprime.Demography.from_species_tree(tree, initial_size)
Parameters:
  • tree (str) – The tree string in Newick format, with named leaves and branch lengths.

  • initial_size – Each population’s initial_size, in numbers of individuals. May be a single number or a mapping from population names to their sizes.

  • growth_rate – Each population’s growth_rate. May be a single number or a mapping from population names to their exponential growth rates. Defaults to zero.

  • time_units (str) – The units of time in which the species tree’s branch lengths are measured. Allowed branch length units are millions of years, years, and generations; these should be specified with the strings "myr", "yr", or "gen", respectively. This defaults to "gen".

  • generation_time (float) – The number of years per generation. If and only if the branch lengths are not in units of generations, the generation time must be specified. This defaults to None.

Returns:

A Demography object representing the specified species tree.

Return type:

msprime.Demography

static from_starbeast(tree, generation_time, time_units='myr')[source]#

Parse a species tree produced by the program TreeAnnotator based on a posterior tree distribution generated with StarBEAST and return the corresponding Demography object.

Species trees produced by TreeAnnotator are written in Nexus format and are rooted, bifurcating, and ultrametric. Branch lengths usually are in units of millions of years, but the use of other units is permitted by StarBEAST (and thus TreeAnnotator). This function allows branch length units of millions of years or years. Leaves must be named and the tree must include information on population sizes of leaf and ancestral species in the form of annotation with the “dmv” tag, which is the case for trees written by TreeAnnotator based on StarBEAST posterior tree distributions.

The returned Demography object contains a Population for each node in the species tree. The population’s name attribute will be either the corresponding node label from the newick tree, if it exists, or otherwise the name takes the form “pop_{j}”, where j is the position of the given population in the list. Leaf populations are first in the list, and added in left-to-right order. Populations corresponding to the internal nodes are then added in a postorder traversal of the species tree. For each internal node a Population split event is added so that lineages move from its child populations at the appropriate time and rates of continuous migration to and from the child populations is set to zero. See the Population split section for more details.

Parameters:
  • tree (str) – The tree string in Nexus format, with named leaves, branch lengths, and branch annotation. Typically, this string is the entire content of a file written by TreeAnnotator.

  • generation_time (float) – The number of years per generation.

  • time_units (str) – The units of time in which the species tree’s branch lengths are measured. Allowed branch length units are millions of years, and years; these should be specified with the strings "myr" or "yr", respectively. This defaults to "myr".

Returns:

A Demography instance that describing the information in the specified species tree.

Return type:

msprime.Demography

static from_old_style(population_configurations=None, *, migration_matrix=None, demographic_events=None, Ne=1, ignore_sample_size=False, population_map=None)[source]#

Creates a Demography object from the pre 1.0 style input parameters, reproducing the old semantics with respect to default values.

No sample information is stored in the new-style Demography objects, and therefore if the sample_size attribute of any of the input PopulationConfiguration objects is set a ValueError will be raised by default. However, if the ignore_sample_size parameter is set to True, this check will not be performed and the sample sizes specified in the old-style PopulationConfiguration objects will be ignored.

Each PopulationConfiguration instance in the list of population_configurations corresponds to the equivalent Population object in the returned Demography. If a PopulationConfiguration has metadata defined and this dictionary contains a name field, this will be used as the Population name. Otherwise, the default population names will be used.

Please see the Specifying samples section for details on how to specify sample locations in sim_ancestry().

Todo

Document the remaining parameters.

static from_demes(graph)[source]#

Creates a Demography object from the specified demes graph. Time values in the demes graph may be specified in any units, but the returned object has units converted into generations. See the Demes section for details.

import demes

graph = demes.load("model.yaml")
demography = msprime.Demography.from_demes(graph)
Parameters:

graph (demes.Graph) – A demes graph.

Returns:

A Demography instance corresponding to the demes model.

Return type:

msprime.Demography

static from_tree_sequence(ts, initial_size=0)[source]#

Creates a Demography object based on the information in the specified tskit.TreeSequence. The returned demography will contain a population for each of the populations in the tree sequence, in the same order.

The metadata for each population in the tree sequence will be inspected. If a schema is present and the metadata can be decoded, the name and description properties of populations are set if the corresponding keys are present.

If the metadata cannot be decoded, the default values for name and description are used.

The initial_size of each of the new populations is set to zero by default, and all other Population attributes are set to their default values. It is therefore essential to update the initial_size and growth_rate values to reflect the desired demography.

See also

See the initial state section for examples of how this method can be used.

Parameters:
  • ts (tskit.TreeSequence) – The tree sequence to extract population information from.

  • initial_size (float) – The default initial size for the newly added populations, as a number of individuals (Default=0).

Returns:

A Demography object representing the populations in the specified tree sequence.

Return type:

msprime.Demography

static isolated_model(initial_size, *, growth_rate=None)[source]#

Returns a Demography object representing a collection of isolated populations with specified initial population sizes and growth rates. Please see Demographic models for more details on population sizes and growth rates.

Parameters:
  • initial_size (list) – the initial_size value for each of the Population in the returned model. The length of the array corresponds to the number of populations. model.

  • growth_rate (list) – The exponential growth rate for each population. Must be either None (the default, resulting a zero growth rate) or an array with the same length as initial_size.

Returns:

A Demography object representing this model, suitable as input to sim_ancestry().

Return type:

msprime.Demography

static island_model(initial_size, migration_rate, *, growth_rate=None)[source]#

Returns a Demography object representing a collection of populations with specified initial population sizes and growth rates, with symmetric migration between each pair of populations at the specified rate. Please see Demographic models for more details on population sizes and growth rates.

Parameters:
  • initial_size (list) – the initial_size value for each of the Population in the returned model. The length of the array corresponds to the number of populations. model.

  • migration_rate (float) – The migration rate between each pair of populations.

  • growth_rate (list) – The exponential growth rate for each population. Must be either None (the default, resulting a zero growth rate) or an array with the same length as initial_size.

Returns:

A Demography object representing this model, suitable as input to sim_ancestry().

Return type:

msprime.Demography

static stepping_stone_model(initial_size, migration_rate, *, growth_rate=None, boundaries=False)[source]#

Returns a Demography object representing a collection of populations with specified initial population sizes and growth rates, in which adjacent demes exchange migrants at the specified rate. Please see Demographic models for more details on population sizes and growth rates.

Note

The current implementation only supports a one-dimensional stepping stone model, but higher dimensions could also be supported. Please open an issue on GitHub if this feature would be useful to you.

Parameters:
  • initial_size (list) – the initial_size value for each of the Population in the returned model. The length of the array corresponds to the number of populations.

  • migration_rate (float) – The migration rate between adjacent pairs of populations.

  • growth_rate (list) – The exponential growth rate for each population. Must be either None (the default, resulting a zero growth rate) or an array with the same length as initial_size.

  • boundaries (bool) – If True the stepping stone model has boundary conditions imposed so that demes at either end of the chain do not exchange migrants. If False (the default), the set of populations is “circular” and migration takes place between the terminal demes.

Returns:

A Demography object representing this model, suitable as input to sim_ancestry().

Return type:

msprime.Demography

to_demes()[source]#

Creates a demes.Graph object from the demography. See the Demes section for details.

Note

Demographic models using bottlenecks added via the add_simple_bottleneck() or add_instantaneous_bottleneck() methods are not able to be converted into a demes graph.

Note

Demes is stricter than msprime with regard to how a demographic model is structured, so models that can be simulated with msprime are not guaranteed to be convertible to a demes graph. In particular, msprime’s legacy API permits setting migrations or other attributes for a population even after that population has been merged into an ancestor. Such models are rarely constructed deliberately, so an error during conversion of a legacy model could indicate model misspecification.

The returned graph can be saved as a Demes-format YAML file using the demes API.

import demes

demography = msprime.Demography.island_model([1000] * 3, 1e-5)
graph = demography.to_demes()
demes.dump(graph, "island_model.yaml")

Or plotted using the demesdraw visualisation package.

import demesdraw

demography = msprime.Demography.island_model([1000] * 3, 1e-5)
graph = demography.to_demes()
ax = demesdraw.tubes(graph)
ax.figure.savefig("island_model.pdf")
Returns:

A demes.Graph object corresponding to the demography.

Return type:

demes.Graph

class msprime.Population(initial_size=0.0, growth_rate=0.0, name=None, description='', extra_metadata=<factory>, default_sampling_time=None, initially_active=None, id=None)[source]#

A single population in a Demography. See the Populations section for more information on what populations represent, and how they can be used.

Warning

This class should not be instantiated directly. Please use Demography.add_population() method instead.

initial_size = 0.0#

The absolute size of the population at time zero. See the Initial size section for more details and examples.

growth_rate = 0.0#

The exponential growth rate of the population per generation (forwards in time). Growth rates can be negative. This is zero for a constant population size, and positive for a population that has been growing. See the Growth rate section for more details and examples.

name = None#

The name of the population. If specified this must be a uniquely identifying string and must be a valid Python identifier (i.e., could be used as a variable name in Python code). See Identifying populations for more details and recommendations on best practise.

description = ''#

A concise description of the population. Defaults to the empty string if not specified.

extra_metadata#

A JSON-encodable dictionary of metadata items to be stored in the associated tskit population object. This dictionary must not contain keys for any of the pre-defined metadata items. See the Metadata section for more details and examples.

default_sampling_time = None#

The default time at which samples are drawn from this population. See the Default sampling time section for more details.

initially_active = None#

If True, this population will always be initially active, regardless of whether it participates in a Population split. If not set, or None, the initial state of the population will be set automatically depending on the events declared in the demography. See the Life cycle section for more details.

id = None#

The integer ID of this population within the parent Demography. This attribute is assigned by the Demography class and should not be set or changed by user code.

class msprime.DemographyDebugger(Ne=1, population_configurations=None, migration_matrix=None, demographic_events=None, model=None, *, demography=None)[source]#

Utilities to compute and display information about the state of populations during the different simulation epochs defined by demographic events.

Warning

This class is not intended to be instantiated directly using the constructor - please use Demography.debug() to obtain a DemographyDebugger for a given Demography instead.

print_history(output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]#

Prints a summary of the history of the populations.

Deprecated since 1.0: use print(debugger) instead.

property population_size_history#

Returns a (num_pops, num_epochs) numpy array giving the starting population size for each population in each epoch.

property epoch_start_time#

The array of epoch start_times defined by the demographic model.

property num_epochs#

Returns the number of epochs defined by the demographic model.

population_size_trajectory(steps)[source]#

Return an array of per-population population sizes, as defined by the demographic model. These are the initial_size parameters of the model, modified by any population growth rates. The sizes are computed at the time points given by steps.

Parameters:

steps (list) – List of times ago at which the population size will be computed.

Returns:

Returns a numpy array of population sizes, with one column per population, whose [i,j]th entry is the size of population j at time steps[i] ago.

lineage_probabilities(steps, sample_time=0)[source]#

Returns an array such that P[j, a, b] is the probability that a lineage that started in population a at time sample_time is in population b at time steps[j] ago.

This function reports sampling probabilities _before_ mass migration events (or other events that move lineages) at a step time, if a mass migration event occurs at one of those times. Migrations will then effect the next time step.

Parameters:
  • steps (list) – A list of times to compute probabilities.

  • sample_time – The time of sampling of the lineage. For any times in steps that are more recent than sample_time, the probability of finding the lineage in any population is zero.

Returns:

An array of dimension len(steps) by num pops by num_pops.

possible_lineage_locations(samples=None)[source]#

Given the sampling configuration, this function determines when lineages are possibly found within each population over epochs defined by demographic events and sampling times. If no sampling configuration is given, we assume we sample lineages from every population at time zero.

The epoch intervals returned are those in which there are distinct configurations of possible lineage locations, and so the number of returned epochs may be less than the total number of epochs defined by the demography and will depend on the input sample configuration.

The samples are specified by either a list of population identifiers ( integer IDs or string names) or by a list of SampleSet objects, allowing sampling times to be specified explicitly. If the time field of the SampleSet is not specified (or population IDs are used) samples are taken at the population’s default_sampling_time. Only SampleSet objects with num_samples > 0 are counted as contributing samples to a particular population.

To support legacy code, Sample objects from the 0.x API can also provided, although its use is discouraged in new code.

Parameters:

samples (list) – The populations that we sample from. Can be either a list of population identifiers, SampleSet or Sample objects.

Returns:

Returns a dictionary with epoch intervals as keys whose values are a list with length equal to the number of populations with True and False indicating which populations could possibly contain lineages over that epoch. The epoch intervals are given by tuples: (epoch start, epoch end). The first epoch necessarily starts at time 0, and the final epoch has end time of infinity.

mean_coalescence_time(lineages, min_pop_size=1, steps=None, rtol=0.005, max_iter=12)[source]#

Compute the mean time until coalescence between pairs of the specified sample lineages. Sample lineages are specified as a mapping from populations to the number of monoploid sample genomes present in that population at time zero. See the Coalescence rates and mean times section for usage examples and more details.

Important

This function assumes a diploid model when computing coalescence rates (see the Coalescent time scales section for more information).

The calculation is performed by using coalescence_rate_trajectory() to compute the probability that the lineages have not yet coalesced by time t, and using these to approximate \(E[T] = \int_t^\infty P(T > t) dt\), where \(T\) is the coalescence time. See coalescence_rate_trajectory() for more details.

To compute this, an adequate time discretisation must be arrived at by iteratively extending or refining the current discretisation. Debugging information about numerical convergence of this procedure is logged using the Python logging infrastructure. The daiquiri module is a convenient way to set up logging, and we can use it to make these messages appear on stderr like this:

import daiquiri

daiquiri.setup(level="DEBUG")
debugger.mean_coalescence_time(1)

Briefly, this outputs iteration number, mean coalescence time, maximum difference in probability of not having coalesced yet, difference to last coalescence time, probability of not having coalesced by the final time point, and whether the last iteration was an extension or refinement.

Parameters:
  • lineages (dict) – A mapping of populations (either integer IDs or string names: see the Identifying populations section for more details) to the number of monoploid sample lineages in that population.

  • min_pop_size (int) – See coalescence_rate_trajectory().

  • steps (list) – The time discretisation to start out with (by default, picks something based on epoch times).

  • rtol (float) – The relative tolerance to determine mean coalescence time to (used to decide when to stop subdividing the steps).

  • max_iter (int) – The maximum number of times to subdivide the steps.

Returns:

The mean coalescence time (a number).

Return type:

float

coalescence_rate_trajectory(steps, lineages, min_pop_size=1, double_step_validation=True)[source]#

Calculate the mean coalescence rates and proportions of uncoalesced lineages between pairs of the specified sample lineages, at each of the times ago listed by steps, in this demographic model. Sample lineages are specified as a mapping from populations to the number of monoploid sample genomes present in that population at time zero. See the Inverse instantaneous coalescence rates section for usage examples and more details.

The coalescence rate at time t in the past is the average rate of coalescence of as-yet-uncoalesced lineages, computed as follows: let \(p(t)\) be the probability that the lineages of a randomly chosen pair of samples has not yet coalesced by time \(t\), let \(p(z,t)\) be the probability that the lineages of a randomly chosen pair of samples has not yet coalesced by time \(t\) and are both in population \(z\), and let \(N(z,t)\) be the diploid effective population size of population \(z\) at time \(t\). Then the mean coalescence rate at time \(t\) is \(r(t) = (\sum_z p(z,t) / (2 * N(z,t))) / p(t)\).

The computation is done by approximating population size trajectories with piecewise constant trajectories between each of the steps. For this to be accurate, the distance between the steps must be small enough so that (a) short epochs (e.g., bottlenecks) are not missed, and (b) populations do not change in size too much over that time, if they are growing or shrinking. This function optionally provides a simple check of this approximation by recomputing the coalescence rates on a grid of steps twice as fine and throwing a warning if the resulting values do not match to a relative tolerance of 0.001.

Parameters:
  • steps (list) – The times ago at which coalescence rates will be computed.

  • lineages (dict) – A mapping of populations (either integer IDs or string names: see the Identifying populations section for more details) to the number of monoploid sample lineages in that population.

  • min_pop_size (int) – The smallest allowed population size during computation of coalescent rates (i.e., coalescence rates are actually 1 / (2 * max(min_pop_size, N(z,t))). Spurious very small population sizes can occur in models where populations grow exponentially but are unused before some time in the past, and lead to floating point error. This should be set to a value smaller than the smallest desired population size in the model.

  • double_step_validation (bool) – Whether to perform the check that step sizes are sufficiently small, as described above. This is highly recommended, and will take at most four times the computation.

Returns:

A tuple of arrays whose jth elements, respectively, are the coalescence rate at the jth time point (denoted r(t[j]) above), and the probability that a randomly chosen pair of lineages has not yet coalesced (denoted p(t[j]) above).

Return type:

(numpy.ndarray, numpy.ndarray)

Likelihoods#

msprime.log_arg_likelihood(ts, recombination_rate, Ne=1)[source]#

Returns the log probability of the stored tree sequence under the Hudson ARG. An exact expression for this probability is given in equation (1) of Kuhner et al. (2000).

We assume branch lengths stored in generations, resulting in a coalescence rate of \(1 / (2 N_e)\) per pair of lineages.

Warning

The stored tree sequence must store the full realisation of the ARG, including all recombination events and all common ancestor events, regardless of whether the recombinations cause a change in the ancestral tree or whether the common ancestor events cause coalescence of ancestral material. See Ancestral recombination graph for details of this data structure, and how to generate them using msprime.

Warning

This method only supports continuous genomes. See Discrete or continuous? for how these can be specified when simulating tree sequences using msprime.

Parameters:
  • ts (tskit.TreeSequence) – The tree sequence object.

  • recombination_rate (float) – The per-link, per-generation recombination probability. Must be non-negative.

  • Ne (float) – The diploid effective population size.

Returns:

The log probability of the tree sequence under the Hudson ancestral recombination graph model. If the recombination rate is zero and the tree sequence contains at least one recombination event, then returns -DBL_MAX.

msprime.log_mutation_likelihood(ts, mutation_rate)[source]#

Returns the unnormalised log probability of the stored pattern of mutations on the stored tree sequence, assuming infinite sites mutation. In particular, each stored site must only contain a single mutation, and all mutant alleles are treated interchangeably as having arisen from a single mutation rate. This is automatically true for some msprime mutation models, such as JC69. The omitted normalising constant depends on the pattern of mutations, but not on the tree sequence or the mutation rate.

Warning

The infinite sites assumption means that this method is only valid for continuous genomes. It can also be run on discrete genomes, for which it may provide a reasonable approximation when the number of sites is large and the mutation rate is low. See Discrete or continuous? for how continuous genomes can be specified when simulating tree sequences using msprime.

The function first computes the probability of the overall number of mutations \(M\) from the Poisson probability mass function

\[e^{-T \mu / 2} \frac{(T \mu / 2)^M}{M!},\]

where \(T\) is the total area of ancestral material in the tree sequence stored in units of generations, and \(\mu\) is the per-site, per-generation mutation probability. Each mutation then contributes an individual factor of \(l / T\), where \(l\) is the total branch length on which the mutation could have arisen while appearing on all of the required lineages, again stored in generations.

Warning

If a tree at the site of a mutation contains unary nodes, then \(l\) could span more than one edge. In particular, we do not constrain mutations to take place on the edge directly above the node on which they have been recorded, but rather on any edge which would yield the same configuration of SNPs at the leaves of the tree sequence.

Parameters:
  • ts (tskit.TreeSequence) – The tree sequence object with mutations.

  • mutation_rate (float) – The per-site, per-generation mutation probability. Must be non-negative.

Returns:

The unnormalised log probability of the observed SNPs given the tree sequence. If the mutation rate is set to zero and the tree sequence contains at least one mutation, then returns -float(“inf”).

Pedigrees#

class msprime.PedigreeBuilder(demography=None, individuals_metadata_schema=None)[source]#

Utility for building pedigrees in the format required for input to the FixedPedigree ancestry model.

See also

See the Pedigrees section for more information on how pedigrees are described and imported in msprime.

Example:

pb = msprime.PedigreeBuilder()
mom_id = pb.add_individual(time=1)
dad_id = pb.add_individual(time=1)
pb.add_individual(time=0, parents=[mom_id, dad_id], is_sample=True)
pedigree_tables = pb.finalise()
Parameters:
  • demography – The Demography defining populations referred to in the populations column, if specified. If None (the default) a demography consisting of one population is used (and only population 0 can be referred to).

  • individuals_metadata_schema (tskit.MetadataSchema) – If specified, set the metadata_schema for the individuals table in the final table collection. Must be an instance of tskit.MetadataSchema See the Metadata section for more information.

add_individual(*, time, is_sample=None, parents=None, population=None, metadata=None)[source]#

Adds an individual with the specified properties, returning its ID.

Parameters:
  • time (float) – The time for this individual measured in generations ago.

  • is_sample (bool) – If True, the new individual is marked as a sample; if False, the individual is not marked as a sample. If None (the default) the individual is marked as a sample if its time is zero. Parent IDs are not checked, and may refer to individuals not yet added to the tables.

  • parents (list(int)) – The integer IDs of the specified individual’s parents. Exactly two parents must be specified. If None (the default), the individual is treated as a founder by setting its parents to [-1, -1].

  • population (str|int) – The population to associated with this individual. The value can be a string or integer, following the usual population identifier rules. If None (the default), a population ID of 0 will be assigned if the Demography associated with this PedigreeBuilder has a single population (the default). If the demography has more than one population, then a population must be explicitly specified for each individual.

  • metadata (dict|bytes) – Any metadata to associate with the new individual. See the Metadata section for more information.

Returns:

The ID of the newly added individual.

Return type:

int

finalise(sequence_length=None)[source]#

Returns the tskit.TableCollection describing the pedigree defined by calls to add_individual().

The state of the pedigree builder is not modified by this method.

Parameters:

sequence_length (float) – If specified, set the sequence_length property of the returned TableCollection to this value. If None (the default) the sequence_length is -1.

Returns:

The TableCollection defining the pedigree.

Return type:

tskit.TableCollection

msprime.parse_pedigree(text_file, *, demography=None, sequence_length=None)[source]#

Parse a text file describing a pedigree used for input to the FixedPedigree ancestry model. See the Pedigree encoding section for more information on the data encoding used for pedigrees.

See also

See the Format definition section for a detailed description of the columns and formatting requirements for this file format.

The returned tskit.TableCollection will contain an individual and two nodes for each data row in the input file. The individual will have a metadata field file_id containing the value of the id column in the input. Individuals (and their corresponding nodes) are added to the tables in the order seen in the file. There is no ordering requirement for parents and children.

If a Demography instance is provided to the demography parameter, this is used to translate and validate population identifiers in the population column, and is also used to fill the population table in the output tskit.TableCollection. See the Pedigrees and demography section for more information on the interaction between demography and FixedPedigree simulations.

Parameters:
  • text_file – A file-like object to read from.

  • demography – The Demography defining populations referred to in the populations column, if specified. If None (the default) a demography consisting of one population is used (and only population 0 can be referred to).

  • sequence_length (float) – If specified, set the sequence_length property of the returned TableCollection to this value.

Returns:

The tskit.TableCollection object containing the corresponding pedigree data.

Return type:

tskit.TableCollection