Quickstart#
This page gives some simple examples of how to use the major features of msprime, with links to more detailed documentation and tutorial content.
See the Installation page for instructions on installing
msprime (short version: pip install msprime
or
conda install -c conda-forge msprime
will work for most users).
Ancestry#
Msprime simulates ancestral histories for a set of sample genomes using backwards-in-time population genetic models. Here we run a simple simulation of a short recombining sequence under human-like parameters:
import msprime
from IPython.display import SVG, display
# Simulate an ancestral history for 3 diploid samples under the coalescent
# with recombination on a 5kb region with human-like parameters.
ts = msprime.sim_ancestry(
samples=3,
recombination_rate=1e-8,
sequence_length=5_000,
population_size=10_000,
random_seed=123456)
# Visualise the simulated ancestral history.
SVG(ts.draw_svg())
In this example we simulate the ancestral history of three diploid individuals (see Specifying samples and Ploidy) for a 5kb sequence with a recombination rate of \(10^{-8}\) (see Genome properties) from a population with a constant size of 10,000 (see the Demography section below) under the default coalescent ancestry model (see the Models for details on other available models). To ensure that the output of this example is predictable, we set a random seed (see Random seeds).
When recombination is present, the ancestry of a sample of DNA sequences cannot be represented by a single genealogical tree relating the samples to their genetic ancestors; there is instead a sequence of highly correlated trees along the genome. The result of our simulation is therefore a tree sequence object from the tskit library, which provides a rich suite of operations for analysing these genealogical histories: see the Getting started with tskit tutorial for help. In this example we show a visualisation of the four different trees along the 5kb region (see the Visualization tutorial for more examples). Because we have specified three diploid sample individuals, each of these trees has 6 “sample” nodes (the “leaves” or “tips”), because each diploid individual has two monoploid genomes (see Specifying samples).
See the Ancestry simulations section for more details on ancestry simulations.
Mutations#
The sim_ancestry()
function generates a simulated ancestral
history for some samples.
If we want genome sequence
we must also simulate some
mutations on these trees.
However, it’s important to note that it’s not always necessary to
simulate mutations in order to use the simulations; often, it’s
better not to; see the Do you really need mutations?
tutorial for more information.
Given an input tree sequence (which may be generated by msprime
or any other simulator that supports tskit
output), we can superimpose mutations on that ancestral history
using the sim_mutations()
function under a number
of different models of sequence
evolution. For example,
here we generate some mutations for the tree sequence simulated
in the previous section under the Jukes-Cantor
model:
mutated_ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=54321)
SVG(mutated_ts.draw_svg())
This visualisation shows us where the mutations occurred both in terms of position along the genome (the tick marks with red chevrons on the x-axis) and the branches of trees that they occurred on (the red crosses). This information is stored in the tskit site and mutation tables:
mutated_ts.tables.sites
id | position | ancestral_state | metadata |
---|---|---|---|
0 | 90 | T | |
1 | 333 | G | |
2 | 819 | T | |
3 | 3204 | A |
mutated_ts.tables.mutations
id | site | node | time | derived_state | parent | metadata |
---|---|---|---|---|---|---|
0 | 0 | 10 | 12191.40649541 | G | -1 | |
1 | 1 | 10 | 44173.26473294 | C | -1 | |
2 | 2 | 9 | 9597.21952351 | G | -1 | |
3 | 3 | 3 | 1158.42914901 | C | -1 |
The combination of sites and mutations on a given ancestry then
defines the variants, which we can access using the
variants()
method:
for variant in mutated_ts.variants():
print(variant)
╔═══════════════════════════════╗
║Variant ║
╠═══════════════════════╤═══════╣
║Site id │ 0║
╟───────────────────────┼───────╢
║Site position │ 90.0║
╟───────────────────────┼───────╢
║Number of samples │ 6║
╟───────────────────────┼───────╢
║Number of alleles │ 2║
╟───────────────────────┼───────╢
║Samples with allele 'T'│1 (17%)║
╟───────────────────────┼───────╢
║Samples with allele 'G'│5 (83%)║
╟───────────────────────┼───────╢
║Has missing data │ False║
╟───────────────────────┼───────╢
║Isolated as missing │ True║
╚═══════════════════════╧═══════╝
╔═══════════════════════════════╗
║Variant ║
╠═══════════════════════╤═══════╣
║Site id │ 1║
╟───────────────────────┼───────╢
║Site position │ 333.0║
╟───────────────────────┼───────╢
║Number of samples │ 6║
╟───────────────────────┼───────╢
║Number of alleles │ 2║
╟───────────────────────┼───────╢
║Samples with allele 'G'│1 (17%)║
╟───────────────────────┼───────╢
║Samples with allele 'C'│5 (83%)║
╟───────────────────────┼───────╢
║Has missing data │ False║
╟───────────────────────┼───────╢
║Isolated as missing │ True║
╚═══════════════════════╧═══════╝
╔═══════════════════════════════╗
║Variant ║
╠═══════════════════════╤═══════╣
║Site id │ 2║
╟───────────────────────┼───────╢
║Site position │ 819.0║
╟───────────────────────┼───────╢
║Number of samples │ 6║
╟───────────────────────┼───────╢
║Number of alleles │ 2║
╟───────────────────────┼───────╢
║Samples with allele 'T'│3 (50%)║
╟───────────────────────┼───────╢
║Samples with allele 'G'│3 (50%)║
╟───────────────────────┼───────╢
║Has missing data │ False║
╟───────────────────────┼───────╢
║Isolated as missing │ True║
╚═══════════════════════╧═══════╝
╔═══════════════════════════════╗
║Variant ║
╠═══════════════════════╤═══════╣
║Site id │ 3║
╟───────────────────────┼───────╢
║Site position │ 3204.0║
╟───────────────────────┼───────╢
║Number of samples │ 6║
╟───────────────────────┼───────╢
║Number of alleles │ 2║
╟───────────────────────┼───────╢
║Samples with allele 'A'│5 (83%)║
╟───────────────────────┼───────╢
║Samples with allele 'C'│1 (17%)║
╟───────────────────────┼───────╢
║Has missing data │ False║
╟───────────────────────┼───────╢
║Isolated as missing │ True║
╚═══════════════════════╧═══════╝
See also
See Getting started with tskit tutorial for more information on how to use tree sequences to efficiently store and analyse data.
Demography#
By default ancestry simulations assume an extremely simple population structure in which a single randomly mating population of a fixed size exists for all time. For most simulations this is an unrealistic assumption, and so msprime provides a way to describe more complex demographic models using the demography API.
For example, here we define a simple three population model in which populations “A” and “B” split from “C” 500 generations ago:
demography = msprime.Demography()
demography.add_population(name="A", initial_size=10_000)
demography.add_population(name="B", initial_size=5_000)
demography.add_population(name="C", initial_size=1_000)
demography.add_population_split(time=500, derived=["A", "B"], ancestral="C")
demography
id | name | description | initial_size | growth_rate | default_sampling_time | extra_metadata |
---|---|---|---|---|---|---|
0 | A | 10000.0 | 0 | 0 | {} | |
1 | B | 5000.0 | 0 | 0 | {} | |
2 | C | 1000.0 | 0 | 5e+02 | {} |
time | type | parameters | effect |
---|---|---|---|
500 | Population Split | derived=[A, B], ancestral=C | Moves all lineages from derived populations 'A' and 'B' to the ancestral 'C' population. Also set the derived populations to inactive, and all migration rates to and from the derived populations to zero. |
The demography API provides debugging tools to help understand and visualise the demographic models we define, as well as some numerical methods to provide analytical predictions about these models.
We can then simulate ancestral histories conditioned on these
models using sim_ancestry()
. For example, here we simulate
5 diploid sample individuals from populations “A” and “B”:
ts = msprime.sim_ancestry({"A": 5, "B": 5}, demography=demography, random_seed=123)
ts
Tree Sequence | |
---|---|
Trees | 1 |
Sequence Length | 1.0 |
Time Units | generations |
Sample Nodes | 20 |
Total Size | 4.9 KiB |
Metadata | No Metadata |
Table | Rows | Size | Has Metadata |
---|---|---|---|
Edges | 38 | 1.2 KiB | |
Individuals | 10 | 304 Bytes | |
Migrations | 0 | 8 Bytes | |
Mutations | 0 | 16 Bytes | |
Nodes | 39 | 1.1 KiB | |
Populations | 3 | 294 Bytes | ✅ |
Provenances | 1 | 1.7 KiB | |
Sites | 0 | 16 Bytes |
See the Demographic models section for more details on defining demographic models, and the Specifying samples section for more details on how to specify samples under these models in ancestry simulations.