Usage#

We’ll first generate a few “undated” tree sequences for later use:

Hide code cell source
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)  # don't display FutureWarnings for stdpopsim
import msprime
import numpy as np
import stdpopsim
import tsinfer
import tskit

# Use msprime to create a simulated tree sequence with mutations, for demonstration
n = 10
Ne = 100
mu = 1e-6
ts = msprime.sim_ancestry(n, population_size=Ne, sequence_length=1e6, random_seed=123)
ts = msprime.sim_mutations(ts, rate=mu, random_seed=123, discrete_genome=False)
# Remove time information
tables = ts.dump_tables()
tables.nodes.time = np.where(tables.nodes.flags & tskit.NODE_IS_SAMPLE, 0, np.arange(ts.num_nodes, dtype=float))
tables.mutations.time = np.full(ts.num_mutations, tskit.UNKNOWN_TIME)
tables.time_units = tskit.TIME_UNITS_UNCALIBRATED
sim_ts = tables.tree_sequence()

# Use stdpopsim to create simulated genomes for inference
species = stdpopsim.get_species("HomSap")
model = species.get_demographic_model("AmericanAdmixture_4B11")
contig = species.get_contig("chr1", left=1e7, right=1.1e7, mutation_rate=model.mutation_rate)
samples = {"AFR": 5, "EUR": 5, "ASIA": 5, "ADMIX": 5}
# Create DNA sequences, stored in the tsinfer SampleData format
stdpopsim_ts = stdpopsim.get_engine("msprime").simulate(model, contig, samples, seed=123)
sample_data = tsinfer.SampleData.from_tree_sequence(stdpopsim_ts)
inf_ts = tsinfer.infer(sample_data)

print(f"* Simulated `sim_ts` ({2*n} genomes from a popn of {Ne} diploids, mut_rate={mu} /bp/gen)")
print(f"* Inferred `inf_ts` using tsinfer ({stdpopsim_ts.num_samples} samples of human {contig.origin})")
* Simulated `sim_ts` (20 genomes from a popn of 100 diploids, mut_rate=1e-06 /bp/gen)
* Inferred `inf_ts` using tsinfer (40 samples of human chr1:10000000-11000000)

Quickstart#

Given a known genetic genealogy in the form of a tree sequence, tsdate simply re-estimates the node times based on the mutations on each edge. Usage is as simple as calling the date() function with an estimated effective population size, and estimated per base pair per generation mutation rate.

import tsdate
# Running `tsdate` is usually a single function call, as follows:
redated_ts = tsdate.date(sim_ts, population_size=100, mutation_rate=1e-6)

This simple example has no recombination, infinite sites mutation, a high mutation rate, and a known genealogy, so we would expect that the node times as estimated by tsdate from the mutations would be very close to the actual node times, as indeed they seem to be:

Hide code cell source
from matplotlib import pyplot as plt
import numpy as np
def plot_real_vs_tsdate_times(x, y, ts_x=None, ts_y=None, delta = 0.1, **kwargs):
    plt.scatter(x + delta, y + delta, **kwargs)
    plt.xscale('log')
    plt.xlabel(f'Real time' + ('' if ts_x is None else f' ({ts_x.time_units})'))
    plt.yscale('log')
    plt.ylabel(f'Estimated time from tsdate' + ('' if ts_y is None else f' ({ts_y.time_units})'))
    line_pts = np.logspace(np.log10(delta), np.log10(x.max()), 10)
    plt.plot(line_pts, line_pts, linestyle=":")

plot_real_vs_tsdate_times(ts.nodes_time, redated_ts.nodes_time, ts, redated_ts)
_images/3cd9be5656d0bf8b026c4235254868139defb7e59243325e6c3e737f6b52fb84.png

Note

See the Timescale adjustment section if you wish to use anything other than the default time_units="generations".

Inferred topologies#

A more typical use-case is where the genealogy has been inferred from DNA sequence data, for example by tsinfer or Relate. Below we will demonstrate with tsinfer output based on DNA sequences generated by a more realistic simulation.

With real data, especially from tsinfer you may want to preprocess the tree sequence before dating. This removes regions with no variable sites, and also simplifies to remove locally unary portions of nodes (see the Simplification and unary nodes section below for more details)

import tsdate
simplified_ts = tsdate.preprocess_ts(inf_ts)
dated_ts = tsdate.date(simplified_ts, population_size=10000, mutation_rate=model.mutation_rate)
print(
    f"Dated `inf_ts` (inferred from {inf_ts.num_sites} variants under the {model.id}",
    f"stdpopsim model, mutation rate = {model.mutation_rate} /bp/gen)"
)
Dated `inf_ts` (inferred from 3912 variants under the AmericanAdmixture_4B11 stdpopsim model, mutation rate = 2.36e-08 /bp/gen)

Note

There was not a fixed population size in the simulation used to generate the data, so we have used a rough commonly-used estimate of an human effective population size of 20,000 (see the Variable population sizes section for more sophisticated approaches).

The inference in this case is much more noisy (as illustrated using the original and inferred times of the node under each mutation):

Hide code cell source
# If there are multiple mutations at a site, we simply pick the first one
plot_real_vs_tsdate_times(
    stdpopsim_ts.nodes_time[[s.mutations[0].node for s in stdpopsim_ts.sites()]],
    dated_ts.nodes_time[[s.mutations[0].node for s in dated_ts.sites()]],
    delta=100,
    alpha=0.1,
)
_images/8216cd65c5c598e7e05a60feba4ea3662f717163bbb7cbb64add656666c5f396.png

Posterior time distributions#

The default output of tsdate is a new, dated tree sequence, created with node times changed to the mean time for each node. This is based on the means of the posterior time distributions for each node.

In the unusual occurrence of the mean time of a child node being older than the mean time of on of its parents, a small value, \(\epsilon\), is added to the parent time to ensure a valid tree sequence. The original mean times and their variances are stored in the node Metadata.

The full posterior distributions of times for each node can be accessed by specifying return_posteriors=True when calling tsdate.date(), which then returns both the dated tree sequence and a dictionary specifying the posterior distributions.

The returned posterior is a dictionary keyed by integer node ID, with values representing the probability distribution of times. This can be read in to a pandas dataframe:

import pandas as pd
redated_ts, posteriors = tsdate.date(
    sim_ts, population_size=100, mutation_rate=1e-6, method="inside_outside", return_posteriors=True)
posteriors_df = pd.DataFrame(posteriors)
posteriors_df.head()  # Show the dataframe
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 time
0 0.000000 0.000000 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.0 0.0 0.0 0.0 0.000000
1 0.100264 0.079272 0.000067 1.076798e-07 1.319789e-61 1.323093e-16 1.824140e-92 1.560503e-101 6.190844e-184 5.428899e-132 0.000000e+00 0.000000e+00 1.783687e-63 0.000000e+00 0.000000e+00 0.0 0.0 0.0 0.0 0.914053
2 0.177675 0.152645 0.001058 7.372226e-06 4.067750e-20 4.385752e-13 7.571748e-23 6.000761e-56 2.922087e-46 1.407553e-49 2.662055e-153 1.528618e-214 3.448580e-54 0.000000e+00 7.531673e-314 0.0 0.0 0.0 0.0 1.348584
3 0.201634 0.186804 0.005519 1.019018e-04 7.638410e-17 8.197991e-11 2.949506e-18 1.356753e-36 5.782621e-38 2.520780e-43 6.473164e-77 2.943222e-113 4.965404e-48 0.000000e+00 4.739415e-261 0.0 0.0 0.0 0.0 1.751783
4 0.182755 0.182299 0.017493 6.918350e-04 1.796166e-14 4.360474e-09 2.368280e-15 2.328367e-31 8.846950e-33 3.131357e-38 7.173261e-68 2.373294e-98 3.229263e-43 5.408274e-221 1.455132e-234 0.0 0.0 0.0 0.0 2.155214

Since we are using a Discrete-time method, each node (numbered column of the dataframe) is associated with a vector of probabilities that sum to one: each cell gives the probability that the time of the node whose ID is given by the column header lies at the specific timepoint given by the time column.

For the continuous-time variational_gamma method, the posterior for each node is represented by the shape and rate parameter of the gamma approximation, as described by the parameter column.

Population sizes#

The population_size can either be a single number, specifying the “effective population size”, or a piecewise constant function of time, specifying a set of fixed population sizes over a number of contiguous time intervals. Functions of this sort are captured by the PopulationSizeHistory class: see the Variable population sizes page for its use and interpretation.

If you don’t have an established estimate for the effective population size of your data, a rough approximation is to use the (sitewise) genetic diversity divided by four-times the mutation rate:

print("A rough estimate of the effective population size is", sim_ts.diversity() / (4 * 1e-6))
A rough estimate of the effective population size is 61.39868421052534

Timescale adjustment#

There is one gotcha involving the population size and the assumed time units used for dating, because theoretically, the effective population size determines the coalescent timescale. If you are using a time scale other than “generations”, as well as setting the “time_scale” parameter (e.g. to “years”), and quoting parameters such as mutation_rate in terms of a per-bp per-year rate, you will also need to modify the effective population size passed to tsdate by multiplying it by the generation time. For example:

import numpy as np
popsize = 100  # Diploid population size
mutation_rate_per_gen = 1e-8
# By default, dates are in generations
ts_generations = tsdate.date(ts, mutation_rate_per_gen, popsize)

# To infer dates in years, adjust both the rates and the population size:
generation_time = 30  # Years
mutation_rate_per_year = mutation_rate_per_gen / generation_time
ts_years = tsdate.date(
    ts, mutation_rate_per_year, popsize * generation_time, time_units="years")

# Check that the inferred node times are identical, just on different scales
assert np.allclose(ts_generations.nodes_time, ts_years.nodes_time / generation_time, 5)

Real data#

Real world data is likely to consist of larger datasets than in the example, and may exhibit issues that are not present in simulated data and can e.g. cause numerical instability and other problems. Here we detail some common issues found in real data.

Memory and run time#

tsdate can be run on most modern computers: large tree sequences of millions or tens of millions of edges will take of the order of hours, and use tens of GB of RAM (e.g. 24 GB / 1 hour on a 2022-era laptop for a tree sequence of 5 million edges covering 60 megabases of 7500 samples of human chromosome 6 from Wohns et al. [2022]).

Todo

Add some scaling plots. Some real-world examples: a dataset of 10K samples of half a million sites (~4M edges) on one node of a 2023 Intel Platinum cluster takes ~30 mins (20GB max memory) for the inside-outside method and ~10 mins (1.5GB max memory) using the variational_gamma method.

Running the dating algorithm is linear in the number of edges in the tree sequence. This makes tsdate usable even for vary large tree sequences (e.g. millions of samples). For large instances, if you are running tsdate interactively, it can be useful to specify the progress option to display a progress bar telling you how long different stages of dating will take.

If the method used for dating involves discrete time slices, tsdate scales quadratically in the number of time slices chosen. For greater temporal resolution, you are thus advised to use the variational_gamma method, which does not discretise time.

Optimisations#

Before the dating algorithm is run, the conditional coalescent prior distribution must be calculated for each node. Although this is roughly linear in the number of nodes, it can still take some time if there are millions of nodes. To speed up this process an approximation to the conditional coalescent is used for nodes which have a large number of descendants (the exact number can be modified by making a prior using the approx_prior_size parameter). The results of this approximation are cached, so that tsdate will run slightly faster the second time it is run.

The time taken to date a tree sequence using tsdate is only a fraction of that requires to infer the initial tree sequence, therefore the core tsdate algorithm has not been parallelised to allow running on many CPU cores. However, there is some precalculation of regularly reused mutational likelihoods which can be parallelised easily: this step can be sped up by specifying the num_threads parameter to date() (however, this behaviour is subject to change in future versions).

CLI use#

Computationally-intensive uses of tsdate are likely to involve running the program non-interactively, e.g. as part of an automated pipeline. In this case, it may be useful to use the command-line interface. See Command line interface for more details.

Numerical stability and preprocessing#

Numerical stability issues are usually caused by “bad” tree sequences (i.e. pathological combinations of topologies and mutations). These can be caused, for example, by long deep branches with very few mutations, such as samples attaching directly to a local root. These can often be fixed by removing bad regions of the tree sequence, e.g. regions that have no variation because they are unmappable or have been removed by a QC filter, such as at the centromere. Numerical instability will manifest itself by raising an error when dating.

The tsdate.preprocess_ts() function can help remove topology from these regions. See the documentation for that function for details on how to increase or decrease its stringency.

Simplification and unary nodes#

Above we have simplified the tree sequence to remove “locally unary” nodes (i.e. nodes that, in some trees, have only one child). The tsinfer algorithm produces these by default, but often overestimates the span of genome that they cover, which causes numerical instability and inaccuracy.

However, we expect accurately estimated unary regions to improve dating. It is therefore possible to date a tree sequence containing locally unary nodes using the allow_unary option when building a prior.