Quickstart

Quickstart#

This page gives some simple examples of how to use the major features of msprime, with links to more detailed documentation and tutorial content.

See the Installation page for instructions on installing msprime (short version: pip install msprime or conda install -c conda-forge msprime will work for most users).

Ancestry#

Msprime simulates ancestral histories for a set of sample genomes using backwards-in-time population genetic models. Here we run a simple simulation of a short recombining sequence under human-like parameters:

    import msprime
    from IPython.display import SVG, display

    # Simulate an ancestral history for 3 diploid samples under the coalescent
    # with recombination on a 5kb region with human-like parameters.
    ts = msprime.sim_ancestry(
        samples=3,
        recombination_rate=1e-8,
        sequence_length=5_000,
        population_size=10_000,
        random_seed=123456)
    # Visualise the simulated ancestral history.
    SVG(ts.draw_svg())

_images/b44b3e7d5c55872620a0368d623e4f0947e12f08bace5315da36edbca089a3b7.svg

In this example we simulate the ancestral history of three diploid individuals (see Specifying samples and Ploidy) for a 5kb sequence with a recombination rate of \(10^{-8}\) (see Genome properties) from a population with a constant size of 10,000 (see the Demography section below) under the default coalescent ancestry model (see the Models for details on other available models). To ensure that the output of this example is predictable, we set a random seed (see Random seeds).

When recombination is present, the ancestry of a sample of DNA sequences cannot be represented by a single genealogical tree relating the samples to their genetic ancestors; there is instead a sequence of highly correlated trees along the genome. The result of our simulation is therefore a tree sequence object from the tskit library, which provides a rich suite of operations for analysing these genealogical histories: see the Getting started with tskit tutorial for help. In this example we show a visualisation of the four different trees along the 5kb region (see the Visualization tutorial for more examples). Because we have specified three diploid sample individuals, each of these trees has 6 “sample” nodes (the “leaves” or “tips”), because each diploid individual has two monoploid genomes (see Specifying samples).

See the Ancestry simulations section for more details on ancestry simulations.

Mutations#

The sim_ancestry() function generates a simulated ancestral history for some samples. If we want genome sequence we must also simulate some mutations on these trees. However, it’s important to note that it’s not always necessary to simulate mutations in order to use the simulations; often, it’s better not to; see the Do you really need mutations? tutorial for more information.

Given an input tree sequence (which may be generated by msprime or any other simulator that supports tskit output), we can superimpose mutations on that ancestral history using the sim_mutations() function under a number of different models of sequence evolution. For example, here we generate some mutations for the tree sequence simulated in the previous section under the Jukes-Cantor model:

mutated_ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=54321)
SVG(mutated_ts.draw_svg())

_images/973b3f744ddfa5309d16b26c32b13e7d9dc18fb92197e8f01bf21ff013a17a97.svg

This visualisation shows us where the mutations occurred both in terms of position along the genome (the tick marks with red chevrons on the x-axis) and the branches of trees that they occurred on (the red crosses). This information is stored in the tskit site and mutation tables:

mutated_ts.tables.sites

id	position	ancestral_state
0	90	T
1	333	G
2	819	T
3	3,204	A

mutated_ts.tables.mutations

id	site	node	time	derived_state	parent
0	0	10	12,191.40649541	G	-1
1	1	10	44,173.26473294	C	-1
2	2	9	9,597.21952351	G	-1
3	3	3	1,158.42914901	C	-1

The combination of sites and mutations on a given ancestry then defines the variants, which we can access using the variants() method:

for variant in mutated_ts.variants():
    print(variant)

╔═══════════════════════════════╗
║Variant                        ║
╠═══════════════════════╤═══════╣
║Site id                │      0║
╟───────────────────────┼───────╢
║Site position          │   90.0║
╟───────────────────────┼───────╢
║Number of samples      │      6║
╟───────────────────────┼───────╢
║Number of alleles      │      2║
╟───────────────────────┼───────╢
║Samples with allele 'T'│1 (17%)║
╟───────────────────────┼───────╢
║Samples with allele 'G'│5 (83%)║
╟───────────────────────┼───────╢
║Has missing data       │  False║
╟───────────────────────┼───────╢
║Isolated as missing    │   True║
╚═══════════════════════╧═══════╝

╔═══════════════════════════════╗
║Variant                        ║
╠═══════════════════════╤═══════╣
║Site id                │      1║
╟───────────────────────┼───────╢
║Site position          │  333.0║
╟───────────────────────┼───────╢
║Number of samples      │      6║
╟───────────────────────┼───────╢
║Number of alleles      │      2║
╟───────────────────────┼───────╢
║Samples with allele 'G'│1 (17%)║
╟───────────────────────┼───────╢
║Samples with allele 'C'│5 (83%)║
╟───────────────────────┼───────╢
║Has missing data       │  False║
╟───────────────────────┼───────╢
║Isolated as missing    │   True║
╚═══════════════════════╧═══════╝

╔═══════════════════════════════╗
║Variant                        ║
╠═══════════════════════╤═══════╣
║Site id                │      2║
╟───────────────────────┼───────╢
║Site position          │  819.0║
╟───────────────────────┼───────╢
║Number of samples      │      6║
╟───────────────────────┼───────╢
║Number of alleles      │      2║
╟───────────────────────┼───────╢
║Samples with allele 'T'│3 (50%)║
╟───────────────────────┼───────╢
║Samples with allele 'G'│3 (50%)║
╟───────────────────────┼───────╢
║Has missing data       │  False║
╟───────────────────────┼───────╢
║Isolated as missing    │   True║
╚═══════════════════════╧═══════╝

╔═══════════════════════════════╗
║Variant                        ║
╠═══════════════════════╤═══════╣
║Site id                │      3║
╟───────────────────────┼───────╢
║Site position          │3,204.0║
╟───────────────────────┼───────╢
║Number of samples      │      6║
╟───────────────────────┼───────╢
║Number of alleles      │      2║
╟───────────────────────┼───────╢
║Samples with allele 'A'│5 (83%)║
╟───────────────────────┼───────╢
║Samples with allele 'C'│1 (17%)║
╟───────────────────────┼───────╢
║Has missing data       │  False║
╟───────────────────────┼───────╢
║Isolated as missing    │   True║
╚═══════════════════════╧═══════╝

Demography#

By default ancestry simulations assume an extremely simple population structure in which a single randomly mating population of a fixed size exists for all time. For most simulations this is an unrealistic assumption, and so msprime provides a way to describe more complex demographic models using the demography API.

For example, here we define a simple three population model in which populations “A” and “B” split from “C” 500 generations ago:

demography = msprime.Demography()
demography.add_population(name="A", initial_size=10_000)
demography.add_population(name="B", initial_size=5_000)
demography.add_population(name="C", initial_size=1_000)
demography.add_population_split(time=500, derived=["A", "B"], ancestral="C")
demography

Populations (3)

id	name	initial_size	default_sampling_time	extra_metadata
0	A	10000.0	0	{}
1	B	5000.0	0	{}
2	C	1000.0	5e+02	{}

Migration matrix (all zero)

Events (1)

time	type	parameters	effect
500	Population Split	derived=[A, B], ancestral=C	Moves all lineages from derived populations 'A' and 'B' to the ancestral 'C' population. Also set the derived populations to inactive, and all migration rates to and from the derived populations to zero.

The demography API provides debugging tools to help understand and visualise the demographic models we define, as well as some numerical methods to provide analytical predictions about these models.

We can then simulate ancestral histories conditioned on these models using sim_ancestry(). For example, here we simulate 5 diploid sample individuals from populations “A” and “B”:

ts = msprime.sim_ancestry({"A": 5, "B": 5}, demography=demography, random_seed=123)
ts

Tree Sequence
Trees	1
Sequence Length	1.0
Time Units	generations
Sample Nodes	20
Total Size	5.1 KiB
Metadata	No Metadata

Table	Rows	Size	Has Metadata
Edges	38	1.2 KiB
Individuals	10	304 Bytes
Migrations	0	8 Bytes
Mutations	0	16 Bytes
Nodes	39	1.1 KiB
Populations	3	294 Bytes	✅
Provenances	1	1.9 KiB
Sites	0	16 Bytes

Provenance Timestamp	Software Name	Version	Command	Full record
25 July, 2025 at 03:34:08 PM	msprime	undefined	sim_ancestry	Details dict schema_version: 1.0.0 software: dict name: msprime version: undefined parameters: dict command: sim_ancestry samples: dict A: 5 B: 5 demography: dict populations: list dict initial_size: 10000 growth_rate: 0 name: A description: extra_metadata: dict default_sampling_time: None initially_active: None id: 0 __class__: msprime.demography.Population dict initial_size: 5000 growth_rate: 0 name: B description: extra_metadata: dict default_sampling_time: None initially_active: None id: 1 __class__: msprime.demography.Population dict initial_size: 1000 growth_rate: 0 name: C description: extra_metadata: dict default_sampling_time: 500 initially_active: False id: 2 __class__: msprime.demography.Population events: list dict time: 500 derived: list A B ancestral: C __class__: msprime.demography.PopulationS<br/>plit migration_matrix: list list 0.0 0.0 0.0 list 0.0 0.0 0.0 list 0.0 0.0 0.0 __class__: msprime.demography.Demography sequence_length: None discrete_genome: None recombination_rate: None gene_conversion_rate: None gene_conversion_tract_length: None population_size: None ploidy: None model: None initial_state: None start_time: None end_time: None record_migrations: None record_full_arg: None additional_nodes: None coalescing_segments_only: None num_labels: None random_seed: 123 replicate_index: 0 environment: dict os: dict system: Linux node: pkrvmpptgkbjq6m release: 6.11.0-1018-azure version: #18~24.04.1-Ubuntu SMP Sat Jun<br/>28 04:46:03 UTC 2025 machine: x86_64 python: dict implementation: CPython version: 3.10.18 libraries: dict kastore: dict version: 2.1.1 tskit: dict version: 0.6.0 gsl: dict version: 2.7

See the Demographic models section for more details on defining demographic models, and the Specifying samples section for more details on how to specify samples under these models in ancestry simulations.