Population sizes and priors

Population sizes and priors#

Note

Functionality described on this page is preliminary and subject to change in the future. For this reason, classes and methods may not form part of the publicly available API and may not be fully documented yet.

The rate of coalescence events over time is determined by demographics, population structure, and selection. The rescaling step of tsdate attempts to distribute coalescences so that the mutation rate over time is reasonably constant. Assuming this is a good approximation, the inverse coalescence rate in a dated tree sequence can be used to infer historical processes.

To illustrate this, we will generate data from a population that was large in recent history, but had a small bottleneck of only 100 individuals 10 thousand generations ago, lasting for (say) 80 generations, with a medium-sized population before that:

import msprime
import demesdraw
from matplotlib import pyplot as plt

bottleneck_time = 10000
demography = msprime.Demography()
demography.add_population(name="Population", initial_size=5e4)
demography.add_population_parameters_change(time=bottleneck_time, initial_size=100)
demography.add_population_parameters_change(time=bottleneck_time + 80, initial_size=2e3)

mutation_rate = 1e-8
# Simulate a short tree sequence with a population size history.
ts = msprime.sim_ancestry(
    10, sequence_length=2e6, recombination_rate=2e-8, demography=demography, random_seed=321)
ts = msprime.sim_mutations(ts, rate=mutation_rate, random_seed=321)
fig, ax = plt.subplots(1, 1, figsize=(4, 6))
demesdraw.tubes(demography.to_demes(), ax, scale_bar=True)
ax.annotate(
  "bottleneck",
  xy=(0, bottleneck_time),
  xytext=(1e4, bottleneck_time * 1.04),
  arrowprops=dict(arrowstyle="->"))
Text(10000.0, 10400.0, 'bottleneck')
_images/fc6e3de3daf3107ebed6042e20a244f8b8649c42e1403a4e00b2b684cbb37e75.png

To test how well tsdate does in this situation, we can redate the known (true) tree sequence topology, which replaces the true node and mutation times with estimates from the dating algorithm.

import tsdate
redated_ts = tsdate.date(ts, mutation_rate=mutation_rate)

If we plot true node time against tsdate-estimated node time for each node in the tree sequence, we can see that the tsdate estimates are pretty much in line with the truth, although there is a a clear band which is difficult to date at 10,000 generations, corresponding to the instantaneous change in population size at that time.

Hide code cell source
import numpy as np

fig, axes = plt.subplots(1, 2, figsize=(10, 5))
def plot_real_vs_tsdate_times(ax, x, y, ts_x=None, ts_y=None, plot_zeros=False, title=None, **kwargs):
    x, y = np.array(x), np.array(y)
    if plot_zeros is False:
        x, y = x[(x * y) > 0], y[(x * y) > 0]
    ax.scatter(x, y, **kwargs)
    ax.set_xscale('log')
    ax.set_xlabel(f'Real time' + ('' if ts_x is None else f' ({ts_x.time_units})'))
    ax.set_yscale('log')
    ax.set_ylabel(f'Estimated time from tsdate'+ ('' if ts_y is None else f' ({ts_y.time_units})'))
    line_pts = np.logspace(np.log10(x.min()), np.log10(x.max()), 10)
    ax.plot(line_pts, line_pts, linestyle=":")
    if title:
      ax.set_title(title)

unconstr_times = [nd.metadata.get("mn", nd.time) for nd in redated_ts.nodes()]
plot_real_vs_tsdate_times(
  axes[0], ts.nodes_time, unconstr_times, ts, redated_ts, alpha=0.1, title="With rescaling")
redated_unscaled_ts = tsdate.date(ts, mutation_rate=mutation_rate, rescaling_intervals=0)
unconstr_times2 = [nd.metadata.get("mn", nd.time) for nd in redated_unscaled_ts.nodes()]
plot_real_vs_tsdate_times(
  axes[1], ts.nodes_time, unconstr_times2, ts, redated_ts, alpha=0.1, title="Without rescaling")
_images/9ab45edbebf6448cfd2b46c64fe9fdf2469e005119a011e81a81d42f8f1fbf06.png

The plot on the right is from running tsdate without the rescaling step. It is clear that for populations with variable sizes over time, rescaling can help considerably in obtaining correct date estimations.

Misspecified priors#

The rescaling method for the default variational_gamma method (inspired by the rescaling algorithm described by Deng et al 2024), coupled with the absence of a strong informative prior on internal nodes, means that we expect this method to be robust to deviations from neutrality and panmixia. However, alternative approaches such as the inside_outside method default to a coalescent prior that assumes a fixed population size. Hence these approaches currently perform very poorly on such data:

import tsdate
est_pop_size = ts.diversity() / (4 * mutation_rate)  # calculate av Ne from data
redated_ts = tsdate.inside_outside(ts, mutation_rate=mutation_rate, population_size=est_pop_size)
unconstr_times = [nd.metadata.get("mn", nd.time) for nd in redated_ts.nodes()]
fig, ax = plt.subplots(1, 1, figsize=(5, 5))
title = "inside_outside method; prior based on fixed Ne"
plot_real_vs_tsdate_times(ax, ts.nodes_time, unconstr_times, ts, redated_ts, alpha=0.1, title=title)
_images/0b51e0e426ff785df354e9cd4c1d7b85e28d967e507521e2f49a3cd2661aadf9.png

If you cannot use the variational_gamma method, the discrete time methods also allow population_size to be either a single number, specifying the “effective population size”, or a piecewise constant function of time, specifying a set of fixed population sizes over a number of contiguous time intervals. Functions of this sort are captured by the PopulationSizeHistory class: see the Population sizes and priors page for its use and interpretation.

Estimating Ne from data#

In the example above, in the absence of an expected effective population size for use in the inside_outside method, we used a value approximated from the data. The standard way to do so is to use the (sitewise) genetic diversity divided by four-times the mutation rate:

print("A rough estimate of the effective population size is", ts.diversity() / (4 * 1e-6))
A rough estimate of the effective population size is 63.65526315789836