Quickstart

Quickstart#

This page gives some simple examples of how to use the major features of msprime, with links to more detailed documentation and tutorial content.

See the Installation page for instructions on installing msprime (short version: pip install msprime or conda install -c conda-forge msprime will work for most users).

Ancestry#

Msprime simulates ancestral histories for a set of sample genomes using backwards-in-time population genetic models. Here we run a simple simulation of a short recombining sequence under human-like parameters:

    import msprime
    from IPython.display import SVG, display

    # Simulate an ancestral history for 3 diploid samples under the coalescent
    # with recombination on a 5kb region with human-like parameters.
    ts = msprime.sim_ancestry(
        samples=3,
        recombination_rate=1e-8,
        sequence_length=5_000,
        population_size=10_000,
        random_seed=123456)
    # Visualise the simulated ancestral history.
    SVG(ts.draw_svg())
_images/e51b576ab189175e36540f7e8a42ed8956c2247eb356156ba220426393b326d9.svg

In this example we simulate the ancestral history of three diploid individuals (see Specifying samples and Ploidy) for a 5kb sequence with a recombination rate of \(10^{-8}\) (see Genome properties) from a population with a constant size of 10,000 (see the Demography section below) under the default coalescent ancestry model (see the Models for details on other available models). To ensure that the output of this example is predictable, we set a random seed (see Random seeds).

When recombination is present, the ancestry of a sample of DNA sequences cannot be represented by a single genealogical tree relating the samples to their genetic ancestors; there is instead a sequence of highly correlated trees along the genome. The result of our simulation is therefore a tree sequence object from the tskit library, which provides a rich suite of operations for analysing these genealogical histories: see the Getting started with tskit tutorial for help. In this example we show a visualisation of the four different trees along the 5kb region (see the Visualization tutorial for more examples). Because we have specified three diploid sample individuals, each of these trees has 6 “sample” nodes (the “leaves” or “tips”), because each diploid individual has two monoploid genomes (see Specifying samples).

See the Ancestry simulations section for more details on ancestry simulations.

Mutations#

The sim_ancestry() function generates a simulated ancestral history for some samples. If we want genome sequence we must also simulate some mutations on these trees. However, it’s important to note that it’s not always necessary to simulate mutations in order to use the simulations; often, it’s better not to; see the Do you really need mutations? tutorial for more information.

Given an input tree sequence (which may be generated by msprime or any other simulator that supports tskit output), we can superimpose mutations on that ancestral history using the sim_mutations() function under a number of different models of sequence evolution. For example, here we generate some mutations for the tree sequence simulated in the previous section under the Jukes-Cantor model:

mutated_ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=54321)
SVG(mutated_ts.draw_svg())
_images/cc188cb39c77f9618874eec9410d7edb602dbeed71018121af6a247e496d9403.svg

This visualisation shows us where the mutations occurred both in terms of position along the genome (the tick marks with red chevrons on the x-axis) and the branches of trees that they occurred on (the red crosses). This information is stored in the tskit site and mutation tables:

mutated_ts.tables.sites
idpositionancestral_statemetadata
090T
1333G
2819T
33204A
mutated_ts.tables.mutations
idsitenodetimederived_stateparentmetadata
001012191.40649541G-1
111044173.26473294C-1
2299597.21952351G-1
3331158.42914901C-1

The combination of sites and mutations on a given ancestry then defines the variants, which we can access using the variants() method:

for variant in mutated_ts.variants():
    print(variant)
╔═══════════════════════════════╗
║Variant                        ║
╠═══════════════════════╤═══════╣
║Site id                │      0║
╟───────────────────────┼───────╢
║Site position          │   90.0║
╟───────────────────────┼───────╢
║Number of samples      │      6║
╟───────────────────────┼───────╢
║Number of alleles      │      2║
╟───────────────────────┼───────╢
║Samples with allele 'T'│1 (17%)║
╟───────────────────────┼───────╢
║Samples with allele 'G'│5 (83%)║
╟───────────────────────┼───────╢
║Has missing data       │  False║
╟───────────────────────┼───────╢
║Isolated as missing    │   True║
╚═══════════════════════╧═══════╝

╔═══════════════════════════════╗
║Variant                        ║
╠═══════════════════════╤═══════╣
║Site id                │      1║
╟───────────────────────┼───────╢
║Site position          │  333.0║
╟───────────────────────┼───────╢
║Number of samples      │      6║
╟───────────────────────┼───────╢
║Number of alleles      │      2║
╟───────────────────────┼───────╢
║Samples with allele 'G'│1 (17%)║
╟───────────────────────┼───────╢
║Samples with allele 'C'│5 (83%)║
╟───────────────────────┼───────╢
║Has missing data       │  False║
╟───────────────────────┼───────╢
║Isolated as missing    │   True║
╚═══════════════════════╧═══════╝

╔═══════════════════════════════╗
║Variant                        ║
╠═══════════════════════╤═══════╣
║Site id                │      2║
╟───────────────────────┼───────╢
║Site position          │  819.0║
╟───────────────────────┼───────╢
║Number of samples      │      6║
╟───────────────────────┼───────╢
║Number of alleles      │      2║
╟───────────────────────┼───────╢
║Samples with allele 'T'│3 (50%)║
╟───────────────────────┼───────╢
║Samples with allele 'G'│3 (50%)║
╟───────────────────────┼───────╢
║Has missing data       │  False║
╟───────────────────────┼───────╢
║Isolated as missing    │   True║
╚═══════════════════════╧═══════╝

╔═══════════════════════════════╗
║Variant                        ║
╠═══════════════════════╤═══════╣
║Site id                │      3║
╟───────────────────────┼───────╢
║Site position          │ 3204.0║
╟───────────────────────┼───────╢
║Number of samples      │      6║
╟───────────────────────┼───────╢
║Number of alleles      │      2║
╟───────────────────────┼───────╢
║Samples with allele 'A'│5 (83%)║
╟───────────────────────┼───────╢
║Samples with allele 'C'│1 (17%)║
╟───────────────────────┼───────╢
║Has missing data       │  False║
╟───────────────────────┼───────╢
║Isolated as missing    │   True║
╚═══════════════════════╧═══════╝

See also

See Getting started with tskit tutorial for more information on how to use tree sequences to efficiently store and analyse data.

Demography#

By default ancestry simulations assume an extremely simple population structure in which a single randomly mating population of a fixed size exists for all time. For most simulations this is an unrealistic assumption, and so msprime provides a way to describe more complex demographic models using the demography API.

For example, here we define a simple three population model in which populations “A” and “B” split from “C” 500 generations ago:

demography = msprime.Demography()
demography.add_population(name="A", initial_size=10_000)
demography.add_population(name="B", initial_size=5_000)
demography.add_population(name="C", initial_size=1_000)
demography.add_population_split(time=500, derived=["A", "B"], ancestral="C")
demography
Populations (3)
idnamedescriptioninitial_sizegrowth_ratedefault_sampling_timeextra_metadata
0A10000.000{}
1B5000.000{}
2C1000.005e+02{}
Migration matrix (all zero)
Events (1)
timetypeparameterseffect
500Population Splitderived=[A, B], ancestral=CMoves all lineages from derived populations 'A' and 'B' to the ancestral 'C' population. Also set the derived populations to inactive, and all migration rates to and from the derived populations to zero.

The demography API provides debugging tools to help understand and visualise the demographic models we define, as well as some numerical methods to provide analytical predictions about these models.

We can then simulate ancestral histories conditioned on these models using sim_ancestry(). For example, here we simulate 5 diploid sample individuals from populations “A” and “B”:

ts = msprime.sim_ancestry({"A": 5, "B": 5}, demography=demography, random_seed=123)
ts
Tree Sequence
Trees1
Sequence Length1.0
Time Unitsgenerations
Sample Nodes20
Total Size4.9 KiB
MetadataNo Metadata
Table Rows Size Has Metadata
Edges 38 1.2 KiB
Individuals 10 304 Bytes
Migrations 0 8 Bytes
Mutations 0 16 Bytes
Nodes 39 1.1 KiB
Populations 3 294 Bytes
Provenances 1 1.7 KiB
Sites 0 16 Bytes

See the Demographic models section for more details on defining demographic models, and the Specifying samples section for more details on how to specify samples under these models in ancestry simulations.