{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "b93b9525",
"metadata": {
"tags": [
"remove-cell"
]
},
"outputs": [],
"source": [
" import msprime\n",
" import tskit\n",
" from IPython.display import SVG\n",
" import numpy as np\n",
" # Doing this to make the notebook outputs deterministic.\n",
" # DO NOT DO THIS IN YOUR CODE\n",
" msprime.core.set_seed_rng_seed(42)"
]
},
{
"cell_type": "markdown",
"id": "a1238724",
"metadata": {},
"source": [
"(sec_mutations)=\n",
"\n",
"# Mutation simulations\n",
"\n",
"{ref}`Ancestry simulations` in msprime simulate a random\n",
"ancestral history of a set of sampled genomes. While the trees representing\n",
"this history are very useful (and often all we need, for many purposes),\n",
"they do not include any information about what the actual genome **sequences**\n",
"would look like for these samples. To produce genetic variation data\n",
"(as represented by a [VCF file](https://samtools.github.io/hts-specs/VCFv4.2.pdf),\n",
"for example), we need to simulate some extra information: we need\n",
"*mutations*. Msprime provides a powerful and efficient way of superimposing\n",
"neutral mutations from a range of sequence evolution\n",
"{ref}`models ` using the {func}`.sim_mutations`\n",
"function.\n",
"\n",
"For example, here we simulate the ancestry of two diploids (and therefore\n",
"four sample genomes) and store the resulting {class}`tskit.TreeSequence`\n",
"object in the ``ts`` variable. We then simulate some mutations on\n",
"that tree sequence, and show the resulting tree with mutations:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "0b04c6d7",
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"execution_count": 2,
"metadata": {
"filenames": {
"image/svg+xml": "/home/runner/work/tskit-site/tskit-site/docs/_build/jupyter_execute/mutations_2_0.svg"
}
},
"output_type": "execute_result"
}
],
"source": [
"ts = msprime.sim_ancestry(2, sequence_length=100, random_seed=1234)\n",
"mts = msprime.sim_mutations(ts, rate=0.01, random_seed=5678)\n",
"SVG(mts.draw_svg())"
]
},
{
"cell_type": "markdown",
"id": "f3a6a28c",
"metadata": {},
"source": [
"We can see the variation data produced by these mutations most simply\n",
"via the {meth}`tskit.TreeSequence.variants` method:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "529a3273",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"7.0\t('T', 'G')\t[0 1 1 0]\n",
"8.0\t('A', 'T')\t[0 1 1 0]\n",
"18.0\t('T', 'A')\t[0 1 1 0]\n",
"19.0\t('A', 'C')\t[0 1 1 0]\n",
"28.0\t('C', 'G')\t[0 1 1 0]\n",
"41.0\t('T', 'C')\t[1 0 0 1]\n",
"55.0\t('T', 'G')\t[0 0 0 1]\n",
"73.0\t('A', 'C')\t[0 1 1 0]\n",
"78.0\t('C', 'G')\t[1 0 0 1]\n",
"79.0\t('G', 'T')\t[1 0 0 1]\n"
]
}
],
"source": [
"for var in mts.variants():\n",
" print(var.site.position, var.alleles, var.genotypes, sep=\"\\t\")"
]
},
{
"cell_type": "markdown",
"id": "4883466a",
"metadata": {},
"source": [
"The [tskit](https://tskit.dev/tskit) library has many powerful\n",
"methods for working with mutation data --- see the\n",
"{ref}`tutorials:sec_tskit_getting_started` tutorial\n",
"for more information.\n",
"\n",
"---\n",
"\n",
"## Quick reference\n",
"\n",
"\n",
"{func}`.sim_mutations`\n",
": Add simulated mutations to a tree sequence\n",
"\n",
"**Models**\n",
"\n",
"{class}`.JC69` (Nucleotides)\n",
": Jukes & Cantor model ('69), equal probability of transitions between nucleotides\n",
"\n",
"{class}`.HKY` (Nucleotides)\n",
": Hasegawa, Kishino & Yano model ('85), different probabilities for transitions and transversions\n",
"\n",
"{class}`.F84` (Nucleotides)\n",
": Felsenstein model ('84), different probabilities for transitions and transversions\n",
"\n",
"{class}`.GTR` (Nucleotides)\n",
": Generalised Time-Reversible nucleotide mutation model\n",
"\n",
"{class}`.BLOSUM62` (Amino acids)\n",
": The BLOSUM62 model of time-reversible amino acid mutation\n",
"\n",
"{class}`.PAM` (Amino acids)\n",
": The PAM model of time-reversible amino acid mutation\n",
"\n",
"{class}`.BinaryMutationModel` (Binary ancestral/derived)\n",
": Binary mutation model with two flip-flopping alleles: \"0\" and \"1\".\n",
"\n",
"{class}`.MatrixMutationModel` (General finite state model)\n",
": Superclass of mutation models with a finite set of states\n",
"\n",
"{class}`.InfiniteAlleles` (Integers)\n",
": A generic infinite-alleles mutation model\n",
"\n",
"{class}`.SLiMMutationModel` (Integers)\n",
": An infinite-alleles model producing SLiM-style mutations\n",
"\n",
"---\n",
"\n",
"(sec_mutations_rate)=\n",
"\n",
"## Specifying rates\n",
"\n",
"The `rate` parameter of {func}`.sim_mutations` determines the rate of mutations per unit\n",
"of sequence length per generation. A rate must be specified for mutations to be generated."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "bcae0c73",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"7"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ts = msprime.sim_ancestry(2, random_seed=1)\n",
"mts = msprime.sim_mutations(ts, rate=1, random_seed=1)\n",
"mts.num_mutations"
]
},
{
"cell_type": "markdown",
"id": "fa5d267a",
"metadata": {},
"source": [
"Running {func}`.sim_mutations` on this simulated tree of length 1 with `rate=1`\n",
"produces 7 mutations.\n",
"\n",
"Specifying higher rates lead to proportionately more mutations:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "2c84633a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"32"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ts = msprime.sim_ancestry(2, random_seed=1)\n",
"mts = msprime.sim_mutations(ts, rate=5, random_seed=1)\n",
"mts.num_mutations"
]
},
{
"cell_type": "markdown",
"id": "72f43636",
"metadata": {},
"source": [
"It's also possible to provide a {class}`.RateMap` which specifies variable\n",
"mutation rates over different stretches of the sequence.\n",
"\n",
"In the following example, the mutation rate between positions 0 and 2 is 0.5,\n",
"between positions 2 and 5 is 0, and between positions 5 and 10 is 0.1:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "1d32542d",
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"execution_count": 6,
"metadata": {
"filenames": {
"image/svg+xml": "/home/runner/work/tskit-site/tskit-site/docs/_build/jupyter_execute/mutations_10_0.svg"
}
},
"output_type": "execute_result"
}
],
"source": [
"ts = msprime.sim_ancestry(4, sequence_length=7, random_seed=1)\n",
"ratemap = msprime.RateMap(position=[0, 2, 5, 7], rate=[0.5, 0, 0.1])\n",
"mts = msprime.sim_mutations(ts, rate=ratemap, random_seed=10)\n",
"SVG(mts.draw_svg(node_labels={}, size=(400, 300)))"
]
},
{
"cell_type": "markdown",
"id": "02cd0d15",
"metadata": {},
"source": [
"As we can see from the output, there are three mutations at site 0 (position 0),\n",
"three mutations at site 1 (position 1), no mutations between positions 2 and 5\n",
"(where the mutation rate is zero) and one mutation at site 2 (position 5).\n",
"This illustrates that as the rate increases, the probability of recurrent mutation\n",
"(multiple mutations at a site) also increases.\n",
"\n",
":::{note}\n",
"\n",
"If using a {class}`.RateMap` with {func}`.sim_mutations`, be sure that the final\n",
"position of the {class}`.RateMap` is the same as the\n",
"`sequence_length` of the tree sequence you're adding mutations to!\n",
"\n",
":::\n",
"\n",
"\n",
"\n",
"(sec_mutations_discrete)=\n",
"\n",
"## Discrete or continuous\n",
"\n",
"As in {func}`.sim_ancestry` (see {ref}`sec_ancestry_discrete_genome`),\n",
"the `discrete_genome` parameter controls whether mutations should be placed at\n",
"discrete, integer coordinates or continuously at floating point positions.\n",
"`discrete_genome=True` is the default:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "23c53fe6",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0., 1., 2., 4.])"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ts = msprime.sim_ancestry(4, sequence_length=5, random_seed=1)\n",
"mts = msprime.sim_mutations(ts, rate=0.1, random_seed=5)\n",
"mts.tables.sites.position"
]
},
{
"cell_type": "markdown",
"id": "bcd39d86",
"metadata": {},
"source": [
"Specifying `discrete_genome=False` places mutations at floating point positions\n",
"continuously along the genome."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "422946fb",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0.44910517, 1.03359577, 1.37043233, 1.98368303, 2.07117512,\n",
" 2.20654612, 4.04140843])"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mts = msprime.sim_mutations(ts, rate=0.1, random_seed=5, discrete_genome=False)\n",
"mts.tables.sites.position"
]
},
{
"cell_type": "markdown",
"id": "5a6010ba",
"metadata": {},
"source": [
":::{note}\n",
"\n",
"Using `discrete_genome=False` means that the mutation model will conform\n",
"to the classic _infinite sites_ assumption, where each mutation in the simulation\n",
"occurs at a new site.\n",
"\n",
":::\n",
"\n",
"(sec_mutations_time_span)=\n",
"\n",
"## Restricting time span\n",
"\n",
"The time range where mutations can be placed is specified using the `start_time`\n",
"and `end_time` parameters. For instance, the following only allows mutations to occur\n",
"earlier than time ``1`` in the tree:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "8a21815d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([2.43665171, 2.36231214, 1.89044463, 1.75278168, 1.5317962 ,\n",
" 1.29553628, 1.18207318, 1.14345689])"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ts = msprime.sim_ancestry(2, random_seed=1)\n",
"mts = msprime.sim_mutations(ts, rate=2, start_time=1, random_seed=1)\n",
"mts.tables.mutations.time"
]
},
{
"cell_type": "markdown",
"id": "015732c6",
"metadata": {},
"source": [
"Note, however, that the child node of the edge where the mutation occurred can be younger\n",
"than `start_time`."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "72186ee9",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0. , 1.9775205 , 0. , 0. , 0. ,\n",
" 0. , 0.17986861, 0. ])"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mts.tables.nodes.time[mts.tables.mutations.node]"
]
},
{
"cell_type": "markdown",
"id": "b155b224",
"metadata": {},
"source": [
"It is also possible to use multiple calls of {func}`.sim_mutations` with `start_time`\n",
"and `end_time` specified to simulate differing mutation rates at different\n",
"periods of time. For instance, the following code simulates mutations at a low rate\n",
"prior to time ``1`` and at a higher rate more recently."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "fbc990cc",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1.29553628 1.15674901]\n",
"[1.29553628 1.15674901 0.96826157 0.90004028 0.80074457 0.72974006\n",
" 0.69232262 0.53881673 0.41919452 0.39676747 0.31342418 0.20233004\n",
" 0.19810148 0.05438014 0.02639678 0.01660881]\n"
]
}
],
"source": [
"ts = msprime.sim_ancestry(2, random_seed=1)\n",
"mts = msprime.sim_mutations(ts, rate=0.1, start_time=1, random_seed=1)\n",
"print(mts.tables.mutations.time)\n",
"mts = msprime.sim_mutations(mts, rate=4, end_time=1, random_seed=1)\n",
"print(mts.tables.mutations.time)"
]
},
{
"cell_type": "markdown",
"id": "7119a097",
"metadata": {},
"source": [
"As explained in {ref}`the following section `, reversing the order of these two\n",
"lines will result in an error, as older mutations must be added first.\n",
"\n",
"\n",
"(sec_mutations_silent)=\n",
"\n",
"## Silent mutations\n",
"\n",
"Some of these mutation models produce *silent mutations*,\n",
"that do not change the allelic state.\n",
"For instance, here are all the mutations in a small simulation\n",
"using the {class}`.HKY`:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "4358ff36",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\t A\tC\tG\tT\n",
"A\t 0\t89\t90\t74\n",
"C\t 102\t36\t148\t63\n",
"G\t 71\t165\t39\t94\n",
"T\t 60\t62\t89\t0\n"
]
}
],
"source": [
"def count_transitions(ts, alleles):\n",
" counts = np.zeros((len(alleles), len(alleles)), dtype='int')\n",
" for s in ts.sites():\n",
" aa = s.ancestral_state\n",
" for m in s.mutations:\n",
" pa = aa\n",
" da = m.derived_state\n",
" if m.parent != tskit.NULL:\n",
" pa = ts.mutation(m.parent).derived_state\n",
" counts[alleles.index(pa), alleles.index(da)] += 1\n",
" print(\"\\t\", \"\\t\".join(alleles))\n",
" for j, a in enumerate(alleles):\n",
" print(f\"{a}\\t\", \"\\t\".join(map(str, counts[j])))\n",
"\n",
"model = msprime.HKY(kappa=0.75, equilibrium_frequencies=[0.2, 0.3, 0.3, 0.2])\n",
"ts = msprime.sim_ancestry(\n",
" 5, random_seed=1, sequence_length=1e7, recombination_rate=1e-8,\n",
" population_size=1000)\n",
"mts = msprime.sim_mutations(ts, rate=1e-8, model=model, random_seed=1)\n",
"count_transitions(mts, model.alleles)"
]
},
{
"cell_type": "markdown",
"id": "7f1e3d0f",
"metadata": {},
"source": [
"There are a moderate but small number of C->C and G->G mutations.\n",
"This is because under the HKY model with these parameters,\n",
"the mutation rates for C and G is less than that of A and T.\n",
"However, msprime puts down mutations on the trees at *constant* rate;\n",
"and so to make the mutation process correct, some of these \"mutations\"\n",
"are chosen to be silent.\n",
"See {ref}`sec_mutations_matrix_mutation_theory` for more details on how this works.\n",
"\n",
"Such silent mutations might be surprising, but they are harmless:\n",
"they do not affect summary statistics\n",
"(e.g., {meth}`tskit.TreeSequence.diversity`)\n",
"or genotypes in any way.\n",
"In fact, leaving them in makes msprime's mutation process *more* robust:\n",
"see {ref}`sec_mutations_state_independence` for details.\n",
"\n",
"To demonstrate insensitivity to these silent mutations,\n",
"let's compute the number of segregating sites for the mutated tree sequence above.\n",
"Here's how many sites it has mutations at:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "58ae0640",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The tree sequence has 1182 sites with mutations.\n"
]
}
],
"source": [
"print(f\"The tree sequence has {mts.num_sites} sites with mutations.\")"
]
},
{
"cell_type": "markdown",
"id": "4d83a6e2",
"metadata": {},
"source": [
"So, we might expect the total number of segregating sites to be 1182.\n",
"However, some are silent, and subtracting the number of these gets the same answer as\n",
"{meth}`tskit.TreeSequence.segregating_sites`:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "c1b20008",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The tree sequence has 75 silent mutations,\n",
"and 1107.0 segregating sites.\n"
]
}
],
"source": [
"num_silent = 0\n",
"for s in mts.sites():\n",
" for m in s.mutations:\n",
" pa = s.ancestral_state\n",
" if m.parent != tskit.NULL:\n",
" pa = mts.mutation(m.parent).derived_state\n",
" if pa == m.derived_state:\n",
" num_silent += 1\n",
"\n",
"segsites = mts.segregating_sites(span_normalise=False)\n",
"\n",
"print(f\"The tree sequence has {num_silent} silent mutations,\\n\"\n",
" f\"and {segsites} segregating sites.\")"
]
},
{
"cell_type": "markdown",
"id": "e7da8771",
"metadata": {},
"source": [
"Indeed, 1182 - 75 = 1107.\n",
"\n",
"\n",
"(sec_mutations_existing)=\n",
"\n",
"## Existing mutations\n",
"\n",
"When you add mutations to a tree sequence which already contains them, the `keep` parameter\n",
"controls whether existing mutations are kept or discarded (the default is `keep=True`).\n",
"For instance, in final code block in {ref}`sec_mutations_time_span`, mutations were\n",
"progressively added to a simulated tree sequence, beginning with the oldest time period.\n",
"This could also be used to add mutations with different mutation models to different segments\n",
"of the genome.\n",
"\n",
"While it is more natural to add younger mutations to a tree sequence which already contains\n",
"older mutations, you can also add mutations ancestral to existing mutations,\n",
"as in the following code block:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "194946d9",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Before: 4 mutations.\n",
"After: 6 mutations.\n"
]
}
],
"source": [
"ts = msprime.sim_ancestry(5, random_seed=1)\n",
"mts = msprime.sim_mutations(ts, rate=1, random_seed=1)\n",
"mmts = msprime.sim_mutations(mts, rate=0.1, random_seed=5)\n",
"print(f\"Before: {mts.num_mutations} mutations.\\n\"\n",
" f\"After: {mmts.num_mutations} mutations.\")"
]
},
{
"cell_type": "markdown",
"id": "1528782f",
"metadata": {},
"source": [
"This might be done, for instance, if a tree sequence is the result of a forwards\n",
"simulation that has non-neutral variation, and you'd like to add neutral mutations.\n",
"\n",
"(sec_mutations_state_independence)=\n",
"\n",
"### Silent mutations and state-independence\n",
"\n",
"Although it is possible to add new mutations ancestrally to\n",
"existing mutations, doing so can produce confusing situations.\n",
"For instance, new mutations added to an already mutated site\n",
"will not assign a new ancestral state: for instance, with the\n",
"{class}`.PAM` of amino acid mutation,\n",
"if a site has ancestral state ``V`` and a mutation with derived state ``I``,\n",
"and a new mutation is added ancestral to the existing one,\n",
"the derived state of the new mutation will be randomly assigned\n",
"(rather than being ``V`` as you might expect).\n",
"This can lead to unlikely or even impossible situations:\n",
"the new mutation in this example might be a ``G``,\n",
"since mutations from ``V`` to ``G`` and ``V`` to ``I`` are both reasonably likely.\n",
"However, a mutation from ``G`` to ``I`` is impossible under this model,\n",
"and we now have the chain V->G->I\n",
"(the chain is possible under the {class}`.BLOSUM62`, but unlikely).\n",
"In practice this is more confusing than concerning,\n",
"since it can only happen when the same site is mutated more than once,\n",
"which affects only a small proportion of sites.\n",
"\n",
"The situation above occurred because the probability of mutating to ``I``\n",
"was very different depending on the initial state.\n",
"However, if derived alleles are chosen in a way *independent* of the previous state,\n",
"then adding mutations at rate {math}`\\alpha`\n",
"and then adding mutations at rate {math}`\\beta`\n",
"is *equivalent* to adding mutations once, at rate {math}`\\alpha + \\beta`.\n",
"A mutation model has this property if rows of its transition matrix are the same,\n",
"and is called *state independent* (or, *parent independent*).\n",
"For instance, mutations under the {class}`.JC69`\n",
"choose *any* derived allele with equal probability.\n",
"This means that the *result* of a mutation can be chosen without knowing\n",
"what the previous state at that site was (\"state independence\"),\n",
"which implies that adding new mutations above a given mutation does not affect\n",
"the probability of that mutation occurring.\n",
"However, this is only true if \"any derived allele\" includes the previous allele,\n",
"i.e., if we include {ref}`silent mutations `.\n",
"The default Jukes-Cantor model will mutate an ``A`` to either a ``C``, ``G``, or ``T``\n",
"with equal probability, which we can check by looking at the *transition matrix*,\n",
"whose rows give the probability that each allele mutates into each other allele\n",
"(see {ref}`sec_mutations_matrix_mutations_models` for more):"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "78237653",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[0. , 0.33333333, 0.33333333, 0.33333333],\n",
" [0.33333333, 0. , 0.33333333, 0.33333333],\n",
" [0.33333333, 0.33333333, 0. , 0.33333333],\n",
" [0.33333333, 0.33333333, 0.33333333, 0. ]])"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"jc69 = msprime.JC69()\n",
"jc69.transition_matrix"
]
},
{
"cell_type": "markdown",
"id": "ddfa69d8",
"metadata": {},
"source": [
"We can check that this does not produce any silent mutations:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "a2c2dd3b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\t A\tC\tG\tT\n",
"A\t 0\t1133\t1137\t1175\n",
"C\t 1116\t0\t1090\t1087\n",
"G\t 1138\t1121\t0\t1172"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"T\t 1142\t1133\t1131\t0\n"
]
}
],
"source": [
"rate = 1.5e-3\n",
"ts = msprime.sim_ancestry(5, random_seed=5, sequence_length=1e6)\n",
"mts = msprime.sim_mutations(ts, rate=rate, model=jc69, random_seed=7)\n",
"count_transitions(mts, jc69.alleles)"
]
},
{
"cell_type": "markdown",
"id": "0990fab1",
"metadata": {},
"source": [
"Although this model does not include silent mutations,\n",
"applying it more than once could produce some -\n",
"for instance, if one mutation with derived state ``C`` is inserted\n",
"above an existing mutation with the same derived state.\n",
"To make it state-independent, we have to choose the derived state\n",
"uniformly from *all* nucleotides:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "43e3921b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[0.25, 0.25, 0.25, 0.25],\n",
" [0.25, 0.25, 0.25, 0.25],\n",
" [0.25, 0.25, 0.25, 0.25],\n",
" [0.25, 0.25, 0.25, 0.25]])"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pi_jc69 = msprime.JC69(state_independent=True)\n",
"pi_jc69.transition_matrix"
]
},
{
"cell_type": "markdown",
"id": "9a816795",
"metadata": {},
"source": [
"With this change, mutating twice with {class}`.JC69`\n",
"is *equivalent* to mutating once with rate equal to the sum of the rates.\n",
"In other words, these two operations are equivalent:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "70b9497e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Mutated once:\n",
"\t A\tC\tG\tT\n",
"A\t 832\t863\t890\t860\n",
"C\t 850\t823\t802\t815\n",
"G\t 865\t833\t860\t877\n",
"T\t 853\t850\t846\t856\n",
"Mutated twice:\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\t A\tC\tG\tT\n",
"A\t 849\t838\t845\t847\n",
"C\t 840\t883\t835\t823\n",
"G\t 827\t885\t840\t835\n",
"T\t 868\t841\t852\t851\n"
]
}
],
"source": [
"alpha = 0.4 * rate\n",
"beta = 0.6 * rate\n",
"# mutate once:\n",
"mts1 = msprime.sim_mutations(ts, rate=alpha + beta, model=pi_jc69, random_seed=7)\n",
"print(\"Mutated once:\")\n",
"count_transitions(mts1, pi_jc69.alleles)\n",
"# or, mutate twice:\n",
"mts2 = msprime.sim_mutations(ts, rate=alpha, model=pi_jc69, random_seed=7)\n",
"mts2 = msprime.sim_mutations(mts2, rate=beta, model=pi_jc69, random_seed=8)\n",
"print(\"Mutated twice:\")\n",
"count_transitions(mts2, pi_jc69.alleles)"
]
},
{
"cell_type": "markdown",
"id": "2cfa1f80",
"metadata": {},
"source": [
"These are very similar - any differences are due to statistical noise.\n",
"\n",
"However, now 25% of the mutations are silent.\n",
"These silent mutations are harmless, and will not affect most things at all.\n",
"However, they do count as mutations, and so including them decreases the\n",
"effective mutation rate.\n",
"Above, we produced ``mts`` by mutating with the default Jukes-Cantor model\n",
"at rate 1.5. Here, we produced ``mts1`` (and ``mts2``) using Jukes-Cantor\n",
"but with 25% of the mutations silent.\n",
"This means that to produce an set of mutations equivalent to ``mts``\n",
"that includes silent mutations, we need to use the parent-independent\n",
"Jukes-Cantor model with rate {math}`1/(1-0.25) = 4/3` times higher:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "f96cfa8e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\t A\tC\tG\tT\n",
"A\t 1134\t1124\t1104\t1226\n",
"C\t 1070\t1040\t1185\t1153\n",
"G\t 1175\t1106\t1072\t1118\n",
"T\t 1086\t1153\t1117\t1113\n"
]
}
],
"source": [
"mts_with_silent = msprime.sim_mutations(ts, rate=rate * 4/3, model=pi_jc69, random_seed=12)\n",
"count_transitions(mts_with_silent, pi_jc69.alleles)"
]
},
{
"cell_type": "markdown",
"id": "20dab210",
"metadata": {},
"source": [
"Note that the offdiagonal counts are similar to what we saw above for ``mts`` -\n",
"again, any differences are due to statistical noise.\n",
"Another way to see this is that nucleotide diversity is similar for ``mts`` and ``mts_with_silent``,\n",
"but is only 75% as big for ``mts1``:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "85606f89",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Nucleotide diversity for default JC69 with rate 0.0015: 0.0053\n",
" parent-independent JC69 with rate 0.0015: 0.0039\n",
" parent-independent JC69 with rate 0.002: 0.0053\n"
]
}
],
"source": [
"print(f\"Nucleotide diversity for default JC69 with rate {rate}: {mts.diversity():.4f}\")\n",
"print(f\" parent-independent JC69 with rate {rate}: {mts1.diversity():.4f}\")\n",
"print(f\" parent-independent JC69 with rate {rate * 4/3}: {mts_with_silent.diversity():.4f}\")"
]
},
{
"cell_type": "markdown",
"id": "93b3ab36",
"metadata": {},
"source": [
"Here's the takeaways from this section:\n",
"\n",
"```{attention}\n",
"1. If you're going to add mutations to the same tree sequence more than once,\n",
" you should use the ``parent_independent`` mutation models\n",
" (or carefully consider the consequences).\n",
"2. If you use a parent-independent model, you need to make your mutation rate higher,\n",
" to account for the silent mutations. See {ref}`sec_mutations_adjusting_for_silent`\n",
" for more details.\n",
"```\n",
"\n",
"\n",
"(sec_mutations_models)=\n",
"\n",
"## Models\n",
"\n",
"Mutation models are specified using the `model` parameter to\n",
"{func}`.sim_mutations`. This parameter can either take the form of a\n",
"string describing the model (e.g. `model=\"jc69\"`) or an instance of a\n",
"model definition class (e.g `model=msprime.JC69()`).\n",
"Here are the available models; they are documented in more detail below.\n",
"\n",
"- {class}`.BinaryMutationModel`: Basic binary mutation model with two flip-flopping alleles: \"0\" and \"1\".\n",
"- {class}`.JC69`: Jukes & Cantor model ('69), equal probability of transitions between nucleotides\n",
"- {class}`.HKY`: Hasegawa, Kishino & Yano model ('85), different probabilities for transitions and transversions\n",
"- {class}`.F84`: Felsenstein model ('84), different probabilities for transitions and transversions\n",
"- {class}`.GTR`: The Generalised Time-Reversible nucleotide mutation model, a general parameterisation of a time-reversible mutation process\n",
"- {class}`.BLOSUM62`: The BLOSUM62 model of time-reversible amino acid mutation\n",
"- {class}`.PAM`: The PAM model of time-reversible amino acid mutation\n",
"- {class}`.MatrixMutationModel`: Superclass of the specific mutation models with a finite set of states\n",
"- {class}`.InfiniteAlleles`: A generic infinite-alleles mutation model\n",
"- {class}`.SLiMMutationModel`: An infinite-alleles model of mutation producing SLiM-style mutations\n",
"\n",
"(sec_mutations_matrix_mutations_models)=\n",
"\n",
"### Matrix Mutation Models\n",
"\n",
"These classes are defined by an alphabet of possible alleles (`alleles`); an array of\n",
"probabilities that determines how likely each allele is to be the root, ancestral allele\n",
"(`root_distribution`); and a transition matrix specifying the probability for each allele\n",
"to mutate to every other allele. Each class has specific values of these parameters to\n",
"create the specific model. For your own custom model these parameters can be set using\n",
"{class}`msprime.MatrixMutationModel`. For more detail about how mutations are simulated\n",
"in these models see {ref}`sec_mutations_matrix_mutation_models_details`.\n",
"\n",
"(sec_mutations_matrix_mutation_models_details)=\n",
"\n",
"### Mutation Matrix Models Details\n",
"\n",
"Mutation matrix models are specified by three things: an alphabet,\n",
"a root distribution, and a transition matrix.\n",
"These leave one free parameter: an overall mutation rate,\n",
"specified by the mutation `rate` in the call to {func}`.sim_mutations`.\n",
"Concisely,\n",
"the underlying model of mutation is a continuous-time Markov chain on the alphabet,\n",
"started by a draw from `root_distribution`, and\n",
"with instantaneous transition rate from `i` to `j` that is equal to\n",
"`rate` multiplied by `transition_matrix[i,j]`.\n",
"The `root distribution` and every row in the `transition_matrix`\n",
"must give *probabilities*, i.e., they must be nonnegative numbers summing to 1.\n",
"For the precise interpretation of these parameters\n",
"(especially when the transition matrix has nonzero entries on the diagonal)\n",
"see {ref}`sec_mutations_matrix_mutation_theory`.\n",
"\n",
"You can define your own, but you probably don't need to:\n",
"there are several mutation matrix models already implemented in `msprime`,\n",
"using binary (0/1), nucleotide, or amino acid alphabets:\n",
"\n",
"### Defining your own finite-sites model\n",
"\n",
"If you want to define your own {class}`.MatrixMutationModel`, you have a good\n",
"deal of freedom. For instance, here's a \"decomposition/growth/disturbance\"\n",
"mutation model, where the only possible transitions are ðŸŽ„ to ðŸ”¥, ðŸ”¥ to ðŸ’©, and\n",
"ðŸ’© to ðŸŽ„, with the first transition happening at one-fifth the rate of the\n",
"other two:"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "0bb21286",
"metadata": {},
"outputs": [],
"source": [
"alleles = [\"ðŸ’©\", \"ðŸŽ„\", \"ðŸ”¥\"]\n",
"model = msprime.MatrixMutationModel(\n",
" alleles,\n",
" root_distribution = [1.0, 0.0, 0.0],\n",
" transition_matrix = [[0.0, 1.0, 0.0],\n",
" [0.0, 0.8, 0.2],\n",
" [1.0, 0.0, 0.0]]\n",
")\n",
"ts = msprime.sim_ancestry(6, population_size=10, random_seed=2, sequence_length=7)\n",
"mts = msprime.sim_mutations(ts, rate=2, random_seed=1, model=model)\n"
]
},
{
"cell_type": "markdown",
"id": "48c8bdc5",
"metadata": {},
"source": [
"We have simulated from this model at rate 2, so the overall rate of mutation\n",
"from ðŸ’© to ðŸŽ„ and ðŸ”¥ to ðŸ’© is 2, and from ðŸŽ„ to ðŸ”¥ is {math}`2 \\times 0.2 = 0.4`.\n",
"As a result, roughly 5/7th of the states will be ðŸŽ„, with the remainder\n",
"divided evenly between ðŸ’© and ðŸ”¥. Here is the resulting \"genotype matrix\":"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "9bb34769",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"ðŸ”¥ðŸŽ„ðŸŽ„ðŸŽ„ðŸ’©ðŸŽ„ðŸŽ„ðŸŽ„ðŸŽ„ðŸŽ„ðŸŽ„ðŸŽ„\n",
"ðŸ’©ðŸ’©ðŸ’©ðŸŽ„ðŸŽ„ðŸŽ„ðŸŽ„ðŸ’©ðŸŽ„ðŸŽ„ðŸŽ„ðŸŽ„\n",
"ðŸŽ„ðŸŽ„ðŸ”¥ðŸ”¥ðŸ’©ðŸŽ„ðŸŽ„ðŸŽ„ðŸŽ„ðŸŽ„ðŸŽ„ðŸŽ„\n",
"ðŸŽ„ðŸŽ„ðŸ’©ðŸ”¥ðŸŽ„ðŸŽ„ðŸŽ„ðŸŽ„ðŸ’©ðŸ’©ðŸ”¥ðŸ’©\n",
"ðŸ’©ðŸŽ„ðŸŽ„ðŸŽ„ðŸ”¥ðŸŽ„ðŸŽ„ðŸŽ„ðŸŽ„ðŸŽ„ðŸ”¥ðŸŽ„\n",
"ðŸŽ„ðŸŽ„ðŸŽ„ðŸŽ„ðŸŽ„ðŸŽ„ðŸŽ„ðŸŽ„ðŸŽ„ðŸŽ„ðŸŽ„ðŸŽ„\n",
"ðŸ’©ðŸŽ„ðŸŽ„ðŸŽ„ðŸŽ„ðŸ’©ðŸŽ„ðŸŽ„ðŸŽ„ðŸ’©ðŸŽ„ðŸ’©\n"
]
}
],
"source": [
"for v in mts.variants():\n",
" print(\"\".join(v.alleles[k] for k in v.genotypes))\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "76750ad4",
"metadata": {},
"source": [
"(sec_mutations_matrix_mutation_theory)=\n",
"\n",
"### Parameterisation of Matrix Mutation Models\n",
"\n",
"Mutation matrix models are specified by three things: an alphabet,\n",
"a root distribution, and a transition matrix.\n",
"These leave one free parameter: an overall mutation rate,\n",
"specified by the mutation `rate` in the call to {func}`.sim_mutations`.\n",
"Concisely,\n",
"the underlying model of mutation is a continuous-time Markov chain on the alphabet,\n",
"started by a draw from `root_distribution`, and\n",
"with instantaneous transition rate from `i` to `j` that is equal to\n",
"`rate` multiplied by `transition_matrix[i,j]`.\n",
"The `root distribution` and every row in the `transition_matrix`\n",
"must give *probabilities*, i.e., they must be nonnegative numbers summing to 1.\n",
"\n",
"To interpret these parameters,\n",
"it helps to know how the underlying mutational process is implemented.\n",
"First, mutations are placed on the tree,\n",
"with a mean density equal to the `rate`, per unit of time and sequence length.\n",
"If `discrete_genome=False` then this is an infinite-sites model,\n",
"so each mutation occurs at a distinct position along the genome.\n",
"If `discrete_genome=True` (the default setting) then at each integer position,\n",
"each branch of the tree at that position gets a Poisson number of mutations\n",
"with mean equal to `rate` multiplied by the length of the branch.\n",
"Next, each site that has a mutation is assigned an ancestral state,\n",
"i.e., the allele at the root of the tree at that position,\n",
"by drawing an allele from the probabilities in the `root_distribution`.\n",
"Now, each mutation is examined, moving down the tree,\n",
"and a derived state is chosen using the probabilities given in the\n",
"row of the `transition_matrix` that corresponds to the \"parental state\",\n",
"i.e., the allele that this mutation will replace.\n",
"Importantly, the mutation is recorded even if the chosen allele is the same\n",
"as the parental allele - so, if the diagonal of the transition matrix\n",
"contains nonzero entries, then the mutated tree sequence will\n",
"(probably) contain silent mutations.\n",
"\n",
"Some facts about Markov chains are useful to interpret the statistics\n",
"of these mutation models.\n",
"First, suppose we have tabulated all mutations, and so for each pair of alleles\n",
"{math}`i` and {math}`j` we have the proportion of mutations that caused an {math}`i \\to j` change.\n",
"If allele {math}`i` mutates to a different allele, the chance it mutates to allele {math}`j`\n",
"is proportional to `transition_matrix[i,j]` but excluding the diagonal (no-change) entry,\n",
"so is equal to `transition_matrix[i,j] / (1 - transition_matrix[i,i])`.\n",
"Second, suppose that an ancestor carries allele {math}`i` at a given position.\n",
"The probability that her descendant some time {math}`t` in the future carries allele {math}`j` at that position\n",
"is given by a matrix exponential of\n",
"the scaled [infinitesimal rate matrix]() of the Markov chain,\n",
"which can be computed as follows:\n",
"\n",
"```{code-block} python\n",
"import scipy\n",
"\n",
"Q = (transition_matrix - np.eye(len(alleles)))\n",
"Pt = scipy.linalg.expm(t * rate * Q)[i,j]\n",
"\n",
"```\n",
"\n",
"If the top of a branch of length {math}`t` has allele {math}`i`,\n",
"the bottom of the branch has allele {math}`j` with probability `Pt[i,j]`.\n",
"\n",
"\n",
"(sec_mutations_adjusting_for_silent)=\n",
"\n",
"#### Adjusting mutation rates for silent mutations\n",
"\n",
"As discussed in {ref}`sec_mutations_silent`, matrix mutation models can insert *silent* mutations,\n",
"which may make it difficult to set the mutation rate appropriately,\n",
"since by \"mutation rate\" we mean the rate of appearance of mutations that *change* the allelic state.\n",
"However, the \"mutation rate\" is not in fact a single value: it depends on context,\n",
"as some alleles may tend to mutate more than others.\n",
"\n",
"As an example of how to appropriately adjust the mutation rate to produce a given expected diversity,\n",
"take the following somewhat contrived problem.\n",
"Suppose we generate a tree sequence of 1000 samples of 10Mb from a population of size 1,000,\n",
"which has mutations generated under the HKY model with {math}`\\kappa = 0.75`\n",
"and equilibrium frequencies {math}`\\pi = [0.1, 0.2, 0.3, 0.4]`.\n",
"We'd like to choose the mutation rate {math}`\\mu` so that the mean genetic diversity\n",
"is equal to 0.001.\n",
"If this was equal to the usual {math}`\\theta = 4Nu`, we'd need a \"mutation rate\"\n",
"of {math}`u = 2.5 \\times 10^{-7}`, so we know we'll need a number slightly higher than that.\n",
"\n",
"To do this, let's figure out the mean rate of appearance of silent mutations.\n",
"We know that (a) the rate of appearance of silent mutations per nucleotide is given by {math}`\\mu`\n",
"times the diagonal of the transition matrix, and (b) the equilibrium frequency of the four\n",
"nucleotides is given by {math}`\\pi`. Multiplying these together and adding them up\n",
"gives us the expected proportion of mutations that are silent.\n",
"Now, writing that we want {math}`u` to equal {math}`\\mu` times one minus the this proportion,\n",
"we get {math}`2.5 \\times 10^{-7} = \\mu (1 - \\sum_i \\pi_i P_{ii})`, so\n",
"solving for {math}`\\mu` gives us:"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "77c16e37",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Mutation rate should be increased by a factor of 1.28.\n"
]
}
],
"source": [
"pi = np.array([0.1, 0.2, 0.3, 0.4])\n",
"hky = msprime.HKY(kappa=0.75, equilibrium_frequencies=pi)\n",
"P = hky.transition_matrix\n",
"u = 2.5e-7\n",
"mu = u / (1 - np.sum(pi * np.diag(P)))\n",
"print(f\"Mutation rate should be increased by a factor of {mu/u:.2f}.\")"
]
},
{
"cell_type": "markdown",
"id": "2697de63",
"metadata": {},
"source": [
"Now, let's verify we get the nucleotide diversity we were aiming for:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "a69167a7",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Genetic diversity: 0.0010836445272138172.\n"
]
}
],
"source": [
"ts = msprime.sim_ancestry(1000, population_size=1000, sequence_length=1e7, recombination_rate=1e-8, random_seed=5)\n",
"mts = msprime.sim_mutations(ts, rate=mu, model=hky, random_seed=27)\n",
"theta = mts.diversity()\n",
"print(f\"Genetic diversity: {theta}.\")"
]
},
{
"cell_type": "markdown",
"id": "c78bdd07",
"metadata": {},
"source": [
"That's pretty close to 0.001! The difference is within statistical error.\n",
"\n",
"\n",
"(sec_mutations_mutation_infinite_alleles)=\n",
"\n",
"### Infinite Alleles Mutation Models\n",
"\n",
"You can also use a model of *infinite alleles* mutation: where each new mutation produces a unique,\n",
"never-before-seen allele. The underlying mutation model just assigns the derived state\n",
"to be a new integer every time a new mutation appears.\n",
"By default these integers start at zero, but a different starting point can be chosen,\n",
"with the `start_allele` parameter.\n",
"It does this globally across all mutations, so that the first assigned allele will be `start_allele`,\n",
"and if `n` alleles are assigned in total (across ancestral and derived states),\n",
"these will be the next `n-1` integers.\n",
"Many theoretical results are derived based on this mutation model (e.g., Ewens' sampling formula).\n",
"\n",
"For instance, here we'll simulate with the infinite alleles model on a single tree,\n",
"and print the resulting tree, labeling each mutation with its derived state:"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "278f0377",
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"execution_count": 26,
"metadata": {
"filenames": {
"image/svg+xml": "/home/runner/work/tskit-site/tskit-site/docs/_build/jupyter_execute/mutations_50_0.svg"
}
},
"output_type": "execute_result"
}
],
"source": [
"ts = msprime.sim_ancestry(6, random_seed=2, sequence_length=1)\n",
"model = msprime.InfiniteAlleles()\n",
"mts = msprime.sim_mutations(ts, rate=2, random_seed=1, model=model)\n",
"t = mts.first()\n",
"ml = {m.id: m.derived_state for m in mts.mutations()}\n",
"SVG(t.draw_svg(mutation_labels=ml, node_labels={}, size=(400, 300)))\n"
]
},
{
"cell_type": "markdown",
"id": "0aa0e7ce",
"metadata": {},
"source": [
"Apparently, there were 20 mutations at this site, but the alleles present in the population are\n",
"\"13\" (in five copies), \"17\" (in two copies), and one copy each of \"14\", \"15\", \"19\", and \"20\".\n",
"Note that all other mutations on the tree are not observed in the population as they have\n",
"been are \"overwritten\" by subsequent mutations.\n",
"\n",
"\n",
":::{warning}\n",
"\n",
"Neither this nor the next infinite alleles mutation model check to see if the alleles\n",
"they produce already exist at the mutated sites. So, if you are using these\n",
"models to add mutations to an already-mutated tree sequence, it is up to you\n",
"to set the starting allele appropriately, and to make sure the results make sense!\n",
"\n",
":::\n",
"\n",
"(sec_mutations_mutation_slim_mutations)=\n",
"\n",
"### SLiM mutations\n",
"\n",
"A special class of infinite alleles model is provided for use with [SLiM](),\n",
"to agree with the underlying mutation model in SLiM.\n",
"As with the InfiniteAlleles model, it assigns each new mutation a unique integer,\n",
"by keeping track of the `next_id` and incrementing it each time a new mutation appears.\n",
"\n",
"This differs from the {class}`.InfiniteAlleles` because mutations\n",
"in SLiM can \"stack\": new mutations can add to the existing state, rather than\n",
"replacing the previous state. So, derived states are comma-separated lists of\n",
"mutation IDs, and the ancestral state is always the empty string. For instance,\n",
"if a new mutation with ID 5 occurs at a site, and then later another mutation\n",
"appears with ID 64, the sequence of alleles moving along this line of descent\n",
"would be `\"\"`, then `\"5\"`, and finally `\"5,64\"`. Furthermore, the mutation\n",
"model adds SLiM metadata to each mutation, which records, among other things,\n",
"the SLiM mutation type of each mutation, and the selection coefficient (which\n",
"is always 0.0, since adding mutations in this way only makes sense if they are\n",
"neutral). For this reason, the model has one required parameter: the `type`\n",
"of the mutation, a nonnegative integer. If, for instance, you specify\n",
"`type=1`, then the mutations in SLiM will be of type `m1`. For more\n",
"information, and for how to modify the metadata (e.g., changing the selection\n",
"coefficients), see\n",
"[the pyslim documentation]().\n",
"For instance,"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "c2548e47",
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"execution_count": 27,
"metadata": {
"filenames": {
"image/svg+xml": "/home/runner/work/tskit-site/tskit-site/docs/_build/jupyter_execute/mutations_52_0.svg"
}
},
"output_type": "execute_result"
}
],
"source": [
"model = msprime.SLiMMutationModel(type=1)\n",
"mts = msprime.sim_mutations(\n",
" ts, rate=1, random_seed=1, model=model)\n",
"t = mts.first()\n",
"ml = {m.id: m.derived_state for m in mts.mutations()}\n",
"SVG(t.draw_svg(mutation_labels=ml, node_labels={}, size=(400, 300)))\n"
]
},
{
"cell_type": "markdown",
"id": "16052603",
"metadata": {},
"source": [
"These resulting alleles show how derived states are built.\n",
"\n",
"The behaviour of this mutation model when used to add mutations to a previously mutated\n",
"tree sequence can be subtle. Let's look at a simple example.\n",
"Here, we first lay down mutations of type 1, starting from ID 0:"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "fc47f1ea",
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"execution_count": 28,
"metadata": {
"filenames": {
"image/svg+xml": "/home/runner/work/tskit-site/tskit-site/docs/_build/jupyter_execute/mutations_54_0.svg"
}
},
"output_type": "execute_result"
}
],
"source": [
"model_1 = msprime.SLiMMutationModel(type=1)\n",
"mts_1 = msprime.sim_mutations(ts, rate=0.5, random_seed=2, model=model_1)\n",
"t = mts_1.first()\n",
"ml = {m.id: m.derived_state for m in mts_1.mutations()}\n",
"SVG(t.draw_svg(mutation_labels=ml, node_labels={}, size=(400, 300)))\n"
]
},
{
"cell_type": "markdown",
"id": "9aa17ebe",
"metadata": {},
"source": [
"Next, we lay down mutations of type 2.\n",
"These we assign starting from ID 100,\n",
"to make it easy to see which are which:\n",
"in general just need to make sure that we start at an ID greater than any\n",
"previously assigned."
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "c073223c",
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"execution_count": 29,
"metadata": {
"filenames": {
"image/svg+xml": "/home/runner/work/tskit-site/tskit-site/docs/_build/jupyter_execute/mutations_56_0.svg"
}
},
"output_type": "execute_result"
}
],
"source": [
"model_2 = msprime.SLiMMutationModel(type=2, next_id=100)\n",
"mts = msprime.sim_mutations(\n",
" mts_1, rate=0.5, random_seed=3, model=model_2, keep=True)\n",
"t = mts.first()\n",
"ml = {m.id: m.derived_state for m in mts.mutations()}\n",
"SVG(t.draw_svg(mutation_labels=ml, node_labels={}, size=(400, 300)))\n"
]
},
{
"cell_type": "markdown",
"id": "f1ff42a5",
"metadata": {},
"source": [
"Note what has happened here: on the top branch on the right side of the tree,\n",
"with the first model we added two mutations: first a mutation with ID `0`,\n",
"then a mutation with ID `3`.\n",
"Then, with the second model, we added two more mutations to this same branch,\n",
"with IDs `100` and `102`, between these two mutations.\n",
"These were added to mutation `0`, obtaining alleles `0,100` and `0,100,102`.\n",
"But then, moving down the branch, we come upon the mutation with ID `3`.\n",
"This was already present in the tree sequence, so its derived state is not modified:\n",
"`0,3`. We can rationalise this, post-hoc, by saying that the type 1 mutation `3`\n",
"has \"erased\" the type 2 mutations `100` and `102`.\n",
"If you want a different arrangement,\n",
"you can go back and edit the derived states (and metadata) as you like."
]
}
],
"metadata": {
"jupytext": {
"text_representation": {
"extension": ".md",
"format_name": "myst",
"format_version": 0.12,
"jupytext_version": "1.9.1"
}
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.12"
},
"source_map": [
14,
24,
46,
50,
53,
56,
112,
116,
123,
127,
135,
140,
167,
171,
176,
179,
197,
201,
206,
208,
215,
221,
236,
257,
278,
280,
286,
300,
320,
326,
376,
379,
383,
388,
397,
400,
406,
418,
433,
436,
443,
447,
525,
538,
545,
551,
644,
651,
655,
660,
682,
691,
735,
744,
752,
760,
768,
777
]
},
"nbformat": 4,
"nbformat_minor": 5
}