{
"cells": [
{
"cell_type": "markdown",
"id": "857b143a",
"metadata": {},
"source": [
"(sec_likelihood)=\n",
"\n",
"# Computing likelihoods\n",
"\n",
"In general, the likelihood function is defined as the probability (or\n",
"probability density) assigned to an observed outcome by a particular model.\n",
"It is often alternately referred to as the sampling probability, and is a\n",
"cornerstone of many statistical inference procedures.\n",
"\n",
"`msprime` provides functions for evaluating two sampling probabilities:\n",
"that of a stored tree sequence for a given diploid population size\n",
"{math}`N_e` and per-link, per-generation recombination probability {math}`r`\n",
"under the standard ancestral recombination graph (ARG); and that of a pattern\n",
"of mutations given a tree sequence and per-site, per-generation mutation\n",
"probability {math}`\\mu` under the infinite sites model.\n",
"\n",
":::{note}\n",
"Generic analysis tools for arbitrary tree sequences, allowing efficient calculation\n",
"of genetic diversity, allele frequency spectra, F statistics, and so on, are\n",
"provided within [tskit](https://tskit.dev/tskit/docs/stable/) and described in the\n",
"[statistics documentation](https://tskit.dev/tskit/docs/stable/stats.html).\n",
"The likelihood functions below are provided by `msprime` because they are specific\n",
"to the ARG and infinite sites models. \n",
":::\n",
"\n",
":::{warning}\n",
"Evaluating these log likelihoods requires knowledge of a whole ARG,\n",
"rather than just the more compact tree sequence representation. Additionally\n",
"the log likelihood implementations are only available for continuous genomes.\n",
"Hence, the underlying tree sequence has to conform to the\n",
"`record_full_arg = True` and `discrete_genome = False` options of the\n",
"{func}`.sim_ancestry` function.\n",
":::\n",
"\n",
"---\n",
"\n",
"## Quick reference\n",
"\n",
"{func}`.log_arg_likelihood` \n",
": Log likelihood of an ARG topology and branch lengths \n",
"\n",
"{func}`.log_mutation_likelihood` \n",
": Log likelihood of a pattern of mutations arising from a given ARG \n",
"\n",
"---\n",
"\n",
"(sec_likelihood_topology)=\n",
"\n",
"## ARG sampling probability\n",
"\n",
"The following example simulates an ARG with 5 diploid samples and evaluates\n",
"the likelihood of the realisation for four combinations of parameters.\n",
"Note that one combination coincides with the parameters used to simulate\n",
"the ARG, while the other combinations are quite different."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "9268a3a4",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"-1.7976931348623157e+308\n",
"-50.65406911332121\n",
"-34.34633889417085\n",
"-109.91004433263002\n"
]
}
],
"source": [
" import msprime\n",
"\n",
" ts = msprime.sim_ancestry(\n",
" 5, recombination_rate=1, record_full_arg=True,\n",
" sequence_length=1, discrete_genome=False, random_seed=42)\n",
" print(msprime.log_arg_likelihood(ts, recombination_rate=0, Ne=1))\n",
" print(msprime.log_arg_likelihood(ts, recombination_rate=0.1, Ne=1))\n",
" print(msprime.log_arg_likelihood(ts, recombination_rate=1, Ne=1))\n",
" print(msprime.log_arg_likelihood(ts, recombination_rate=10, Ne=10))"
]
},
{
"cell_type": "markdown",
"id": "ebab7982",
"metadata": {},
"source": [
"In this example, the simulated ARG contains at least one recombination,\n",
"which is an event of probability 0 when the `recombination_rate = 0`.\n",
"Hence, the log likelihood for recombination rate zero returns a numerical\n",
"representation of negative infinity, i.e. the logarithm of zero. The other\n",
"three combinations of parameters all result in positive likelihoods,\n",
"or finite log likelihoods.\n",
"\n",
"The data was generated from a recombination rate of one and a default\n",
"population size of one, and these parameters give rise to a relatively high\n",
"log likelihood. The same ARG would have been an unlikely realisation under\n",
"very different parameters, which thus result in more negative values of the\n",
"log likelihood.\n",
"\n",
"## Mutation sampling probability\n",
"\n",
"The next example adds random mutations to the tree sequence\n",
"generated above in {ref}`sec_likelihood_topology` and evaluates the\n",
"unnormalised log probability of the mutation realisation, given the\n",
"tree sequence and a prescribed mutation rate."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "3fe68d22",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"-inf\n",
"-21.388260791541896\n",
"-9.646968662129233\n",
"-58.02017406357392\n"
]
}
],
"source": [
" ts = msprime.mutate(ts, rate=1, random_seed=42)\n",
" print(msprime.log_mutation_likelihood(ts, mutation_rate=0))\n",
" print(msprime.log_mutation_likelihood(ts, mutation_rate=0.1))\n",
" print(msprime.log_mutation_likelihood(ts, mutation_rate=1))\n",
" print(msprime.log_mutation_likelihood(ts, mutation_rate=10))"
]
},
{
"cell_type": "markdown",
"id": "5b95aae6",
"metadata": {},
"source": [
"Since there is at least one mutation in the realisation, its probability\n",
"given `mutation_rate = 0` is 0, resulting in a log likelihood of negative\n",
"infinity. The mutation realisation is a typical outcome when\n",
"`mutation_rate = 1`, which is the value used to simulate it, and which thus\n",
"results in a relatively high log likelihood. Mutation rates which are\n",
"significantly higher or lower result in more negative log likelihoods\n",
"because generating the same realisation using those rates is unlikely."
]
}
],
"metadata": {
"jupytext": {
"text_representation": {
"extension": ".md",
"format_name": "myst",
"format_version": 0.12,
"jupytext_version": "1.9.1"
}
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.13"
},
"source_map": [
12,
70,
80,
102,
108
]
},
"nbformat": 4,
"nbformat_minor": 5
}