Specifying a Prior#

Basic usage#

The build_prior_grid() and build_parameter_grid() functions allow you to create a bespoke prior for the Discrete-time and Continuous-time, respectively. This can be passed in to date() using the priors argument. It provides a tuneable alternative to passing the population size directly to the date() function.

Along with adjusting the method, this is the recommended way to carry out more sophisticated analyses, tweaking the tsdate algorithm to alter its runtime and accuracy.

import tskit
import tsdate
ts = tskit.load("data/basic_example.trees")
mu = 1e-8  # mutation rate
N = 100  # effective population size

prior1 = tsdate.build_prior_grid(ts, population_size=N)
prior2 = tsdate.build_prior_grid(ts, population_size=N, timepoints=40)
prior3 = tsdate.build_prior_grid(ts, population_size=N, prior_distribution="gamma")

ts1 = tsdate.date(ts, mu, priors=prior1)  # Identical to tsdate.date(ts, mu, population_size=N)
ts2 = tsdate.date(ts, mu, priors=prior2)  # Uses a denser timegrid
ts3 = tsdate.date(ts, mu, priors=prior3)  # Approximates the discrete-time priors with a gamma

See below for more explanation of the interpretation of the parameters passed to build_prior_grid().

Prior shape#

For Discrete-time methods, it is possible to switch from the (default) lognormal approximation to a gamma distribution, used when building a mixture prior for nodes that have variable numbers of children in different genomic regions. The discretised prior is then constructed by evaluating the lognormal (or gamma) distribution across a grid of fixed times. Tests have shown that the lognormal is usually a better fit to the true prior in most cases.

For Continuous-time methods, only the gamma distribution is available.

Setting the timegrid#

For Discrete-time methods, a grid of timepoints is created. An explicit timegrid can be passed via the timepoints parameter. Alternatively, if a single integer is passed, a nonuniform time grid will be chosen based on quantiles of the lognormal (or gamma) approximation of the mixture prior.

Note that if an integer is given this is not the number of timepoints, but the number of quantiles used as a basis for generating timepoints. The actual number of timepoints will be larger than this number. For instance

timepoints = 10
prior = tsdate.build_prior_grid(ts, population_size=N, timepoints=timepoints)
dated_ts = tsdate.date(ts, mu, priors=prior)

print(
    f"`timepoints`={timepoints}, producing a total of {len(prior.timepoints)}",
    "timepoints in the timegrid, at these times"
)
print(prior.timepoints)
`timepoints`=10, producing a total of 27 timepoints in the timegrid, at these times
[  0.           1.31114259   2.08111371   2.8918665    3.81479025
   4.91775488   6.29720612   8.11790797  10.70869495  14.88054978
  23.61918259  32.27581476  44.76860467  59.9292899   83.56615732
  99.98261661 119.46746649 141.08161503 169.06167087 190.74961256
 227.93254483 289.7263965  351.13933362 397.4112448  454.85521937
 533.97023983 668.83044558]

For Continuous-time methods, a grid of variational parameters is created (e.g. shape and rate parameters of gamma distributions for each node), which may be modified manually. Currently, node-specific priors are combined to generate a global i.i.d. prior (although this behaviour will be changed in future releases to provide more flexibility.)

The conditional coalescent#

Currently, priors are based on the conditional coalescent. Specifically, in a tree sequence of s samples, the distribution of times for a node that always has n descendant samples is taken from the theoretical distribution of times for a node with n descendant tips averaged over all coalescent trees of s total tips (note that this assumes that all the samples are at time 0)

In most tree sequences, a node will not always have the same number of descendant samples in all regions of the genome. For such nodes, an approximate prior is constructed by averaging the mean and variance of the times, and then constructing a mixture prior based on a lognormal (default) or gamma distribution with the same mean and variance, weighted by the span of the node in each local tree.

It is unclear how well this approximation works in practice, as there are clear correlations between the span of a node and the number of children it has. Testing indicates that using a single prior for all nodes may provide better accuracy; this may also be a result of too strongly constraining each node prior to a fixed topology before dating. Currently the variational_gamma method using a single (“global”) prior for all nodes. The best prior to use for different methods is a current topic of investigation, and may be subject to change in future versions of tsdate.