Computing likelihoods

In general, the likelihood function is defined as the probability (or probability density) assigned to an observed outcome by a particular model. It is often alternately referred to as the sampling probability, and is a cornerstone of many statistical inference procedures.

msprime provides functions for evaluating two sampling probabilities: that of a stored tree sequence for a given diploid population size \(N_e\) and per-link, per-generation recombination probability \(r\) under the standard ancestral recombination graph (ARG); and that of a pattern of mutations given a tree sequence and per-site, per-generation mutation probability \(\mu\) under the infinite sites model.

Note

Generic analysis tools for arbitrary tree sequences, allowing efficient calculation of genetic diversity, allele frequency spectra, F statistics, and so on, are provided within tskit and described in the statistics documentation. The likelihood functions below are provided by msprime because they are specific to the ARG and infinite sites models.

Warning

Evaluating these log likelihoods requires knowledge of a whole ARG, rather than just the more compact tree sequence representation. Additionally the log likelihood implementations are only available for continuous genomes. Hence, the the underlying tree sequence has to conform to the record_full_arg = True and discrete_genome = False options of the sim_ancestry() function.


Quick reference

log_arg_likelihood()

Log likelihood of an ARG topology and branch lengths

log_mutation_likelihood()

Log likelihood of a pattern of mutations arising from a given ARG


ARG sampling probability

The following example simulates an ARG with 5 diploid samples and evaluates the likelihood of the realisation for four combinations of parameters. Note that one combination coincides with the parameters used to simulate the ARG, while the other combinations are quite different.

    import msprime

    ts = msprime.sim_ancestry(
        5, recombination_rate=1, record_full_arg=True,
        sequence_length=1, discrete_genome=False, random_seed=42)
    print(msprime.log_arg_likelihood(ts, recombination_rate=0, Ne=1))
    print(msprime.log_arg_likelihood(ts, recombination_rate=0.1, Ne=1))
    print(msprime.log_arg_likelihood(ts, recombination_rate=1, Ne=1))
    print(msprime.log_arg_likelihood(ts, recombination_rate=10, Ne=10))
-1.7976931348623157e+308
-50.65406911332121
-34.34633889417085
-109.91004433263002

In this example, the simulated ARG contains at least one recombination, which is an event of probablity 0 when the recombination_rate = 0. Hence, the log likelihood for recombination rate zero returns a numerical representation of negative infinity, i.e. the logarithm of zero. The other three combinations of parameters all result in positive likelihoods, or finite log likelihoods.

The data was generated from a recombination rate of one and a default population size of one, and these parameters give rise to a relatively high log likelihood. The same ARG would have been an unlikely realisation under very different parameters, which thus result in more negative values of the log likelihood.

Mutation sampling probability

The next example adds random mutations to the tree sequence generated above in ARG sampling probability and evaluates the unnormalised log probability of the mutation realisation, given the tree sequence and a prescribed mutation rate.

    ts = msprime.mutate(ts, rate=1, random_seed=42)
    print(msprime.log_mutation_likelihood(ts, mutation_rate=0))
    print(msprime.log_mutation_likelihood(ts, mutation_rate=0.1))
    print(msprime.log_mutation_likelihood(ts, mutation_rate=1))
    print(msprime.log_mutation_likelihood(ts, mutation_rate=10))
-inf
-21.388260791541896
-9.646968662129233
-58.02017406357392

Since there is at least one mutation in the realisation, its probability given mutation_rate = 0 is 0, resulting in a log likelihood of negative infinity. The mutation realisation is a typical outcome when mutation_rate = 1, which is the value used to simulate it, and which thus results in a relatively high log likelihood. Mutation rates which are significantly higher or lower result in more negative log likelihoods because generating the same realisation using those rates is unlikely.