# Computing likelihoods

## Contents

# Computing likelihoods#

In general, the likelihood function is defined as the probability (or probability density) assigned to an observed outcome by a particular model. It is often alternately referred to as the sampling probability, and is a cornerstone of many statistical inference procedures.

`msprime`

provides functions for evaluating two sampling probabilities:
that of a stored tree sequence for a given diploid population size
\(N_e\) and per-link, per-generation recombination probability \(r\)
under the standard ancestral recombination graph (ARG); and that of a pattern
of mutations given a tree sequence and per-site, per-generation mutation
probability \(\mu\) under the infinite sites model.

Note

Generic analysis tools for arbitrary tree sequences, allowing efficient calculation
of genetic diversity, allele frequency spectra, F statistics, and so on, are
provided within tskit and described in the
statistics documentation.
The likelihood functions below are provided by `msprime`

because they are specific
to the ARG and infinite sites models.

Warning

Evaluating these log likelihoods requires knowledge of a whole ARG,
rather than just the more compact tree sequence representation. Additionally
the log likelihood implementations are only available for continuous genomes.
Hence, the underlying tree sequence has to conform to the
`record_full_arg = True`

and `discrete_genome = False`

options of the
`sim_ancestry()`

function.

## Quick reference#

`log_arg_likelihood()`

Log likelihood of an ARG topology and branch lengths

`log_mutation_likelihood()`

Log likelihood of a pattern of mutations arising from a given ARG

## ARG sampling probability#

The following example simulates an ARG with 5 diploid samples and evaluates the likelihood of the realisation for four combinations of parameters. Note that one combination coincides with the parameters used to simulate the ARG, while the other combinations are quite different.

```
import msprime
ts = msprime.sim_ancestry(
5, recombination_rate=1, record_full_arg=True,
sequence_length=1, discrete_genome=False, random_seed=42)
print(msprime.log_arg_likelihood(ts, recombination_rate=0, Ne=1))
print(msprime.log_arg_likelihood(ts, recombination_rate=0.1, Ne=1))
print(msprime.log_arg_likelihood(ts, recombination_rate=1, Ne=1))
print(msprime.log_arg_likelihood(ts, recombination_rate=10, Ne=10))
```

```
-1.7976931348623157e+308
-50.65406911332121
-34.34633889417085
-109.91004433263002
```

In this example, the simulated ARG contains at least one recombination,
which is an event of probability 0 when the `recombination_rate = 0`

.
Hence, the log likelihood for recombination rate zero returns a numerical
representation of negative infinity, i.e. the logarithm of zero. The other
three combinations of parameters all result in positive likelihoods,
or finite log likelihoods.

The data was generated from a recombination rate of one and a default population size of one, and these parameters give rise to a relatively high log likelihood. The same ARG would have been an unlikely realisation under very different parameters, which thus result in more negative values of the log likelihood.

## Mutation sampling probability#

The next example adds random mutations to the tree sequence generated above in ARG sampling probability and evaluates the unnormalised log probability of the mutation realisation, given the tree sequence and a prescribed mutation rate.

```
ts = msprime.mutate(ts, rate=1, random_seed=42)
print(msprime.log_mutation_likelihood(ts, mutation_rate=0))
print(msprime.log_mutation_likelihood(ts, mutation_rate=0.1))
print(msprime.log_mutation_likelihood(ts, mutation_rate=1))
print(msprime.log_mutation_likelihood(ts, mutation_rate=10))
```

```
-inf
-21.388260791541896
-9.646968662129233
-58.02017406357392
```

Since there is at least one mutation in the realisation, its probability
given `mutation_rate = 0`

is 0, resulting in a log likelihood of negative
infinity. The mutation realisation is a typical outcome when
`mutation_rate = 1`

, which is the value used to simulate it, and which thus
results in a relatively high log likelihood. Mutation rates which are
significantly higher or lower result in more negative log likelihoods
because generating the same realisation using those rates is unlikely.