Usage#

We’ll first generate a few example “undated” tree sequences:

Hide code cell source
import msprime
import numpy as np
import stdpopsim
import tsinfer
import tskit

# Use msprime to create a simulated tree sequence with mutations, for demonstration
n = 10
Ne = 100
mu = 1e-6
ts = msprime.sim_ancestry(n, population_size=Ne, sequence_length=1e6, random_seed=123)
ts = msprime.sim_mutations(ts, rate=mu, random_seed=123, discrete_genome=False)
# Remove time information
tables = ts.dump_tables()
tables.nodes.time = np.where(tables.nodes.flags & tskit.NODE_IS_SAMPLE, 0, np.arange(ts.num_nodes, dtype=float))
tables.mutations.time = np.full(ts.num_mutations, tskit.UNKNOWN_TIME)
tables.time_units = tskit.TIME_UNITS_UNCALIBRATED
sim_ts = tables.tree_sequence()

# Use stdpopsim to create simulated genomes for inference
species = stdpopsim.get_species("HomSap")
model = species.get_demographic_model("AmericanAdmixture_4B18")
contig = species.get_contig("chr1", left=1e7, right=1.1e7, mutation_rate=model.mutation_rate)
samples = {"AFR": 5, "EUR": 5, "ASIA": 5, "ADMIX": 5}
# Create DNA sequences, stored in the tsinfer SampleData format
stdpopsim_ts = stdpopsim.get_engine("msprime").simulate(model, contig, samples, seed=123)
sample_data = tsinfer.SampleData.from_tree_sequence(stdpopsim_ts)
inf_ts = tsinfer.infer(sample_data)

print(f"* Simulated `sim_ts` ({2*n} genomes from a popn of {Ne} diploids, mut_rate={mu} /bp/gen)")
print(f"* Inferred `inf_ts` using tsinfer ({stdpopsim_ts.num_samples} samples of human {contig.origin})")
* Simulated `sim_ts` (20 genomes from a popn of 100 diploids, mut_rate=1e-06 /bp/gen)
* Inferred `inf_ts` using tsinfer (40 samples of human chr1:10000000-11000000)

Quickstart#

Given a known genetic genealogy in the form of a tree sequence, tsdate simply re-estimates the node times based on the mutations on each edge. Usage is as simple as calling the date() function with an estimated per base pair per generation mutation rate.

import tsdate
# Running `tsdate` is usually a single function call, as follows:
mu_per_bp_per_generation = 1e-6
redated_ts = tsdate.date(sim_ts, mutation_rate=mu_per_bp_per_generation)

This simple example has no recombination, infinite sites mutation, a high mutation rate, and a known genealogy, so we would expect that the node times as estimated by tsdate from the mutations would be very close to the actual node times, as indeed they seem to be:

Hide code cell source
from matplotlib import pyplot as plt
import scipy.stats as stats

import numpy as np
def plot_real_vs_tsdate_times(
    x, y, ax=None, ts_x=None, ts_y=None, y_variance=None, delta=0.1, **kwargs
):
    if ax is None:
        ax=plt.gca()
    x, y = np.array(x), np.array(y)
    line_pts = np.logspace(np.log10(delta), np.log10(x.max()), 10)
    ax.plot(line_pts, line_pts, linestyle=":", c="black")
    if y_variance is None:
        ax.scatter(x + delta, y + delta, **kwargs)
    else:
        # variance is of a gamma distribution. The sample nodes have zero mean, so ignore those divisions
        with np.errstate(invalid='ignore'):
            scale = y_variance / y
            shape = y / scale
            lower = y - stats.gamma.ppf(0.025, shape, scale=scale)
            upper = stats.gamma.ppf(0.975, shape, scale=scale) - y
            err = np.array([lower, upper])
            ax.errorbar(x + delta, y + delta, err + delta, fmt='_', ecolor='lightgrey', mec='tab:blue')
    ax.set_xscale('log')
    ax.set_xlabel(f'Real time' + ('' if ts_x is None else f' ({ts_x.time_units})'))
    ax.set_yscale('log')
    ax.set_ylabel(f'Estimated time from tsdate' + ('' if ts_y is None else f' ({ts_y.time_units})'))

fig, axes = plt.subplots(1, 2, figsize=(10, 4), sharex=True, sharey=True)
plot_real_vs_tsdate_times(ts.nodes_time, redated_ts.nodes_time, axes[0], ts, redated_ts)
mean, var = np.array([[n.metadata["mn"], n.metadata["vr"]] for n in redated_ts.nodes()]).T
plot_real_vs_tsdate_times(ts.nodes_time, mean, axes[1], ts, redated_ts, y_variance=var)
axes[1].set_ylabel("");
_images/8cd7cfed0ea9b941c6cfc5a3487dfb3ed0be5f6a8f532a4e6e0ffc9f35a43a62.png

The left hand plot shows the redated tree sequence node times. The right hand plot is essentially identical but with 95% confidence intervals (technically, it shows the unconstrained posterior node times, with confidence intervals based on the gamma distribution fitted by the default tsdate method).

Note

By default time is assumed to be measured in “generations”, but this can be changed by using the time_units parameter: see the Timescale adjustment section.

Inferred topologies#

A more typical use-case is where the genealogy has been inferred from DNA sequence data, for example by tsinfer or Relate. Below we will demonstrate with tsinfer output based on DNA sequences generated by a more realistic simulation.

With real data, especially from tsinfer you may want to preprocess the tree sequence before dating. This removes regions with no variable sites, also simplifies to remove locally unary portions of nodes, and splits disjoint nodes into separate datable nodes (see the Preprocessing section for more details)

import tsdate
simplified_ts = tsdate.preprocess_ts(
    inf_ts,
    remove_telomeres=False  # Simulated example, so no flanking regions / telomeres exist
)
dated_ts = tsdate.date(simplified_ts, mutation_rate=model.mutation_rate)
print(
    f"Dated `inf_ts` (inferred from {inf_ts.num_sites} variants under the {model.id}",
    f"stdpopsim model, mutation rate = {model.mutation_rate} /bp/gen)"
)
Dated `inf_ts` (inferred from 3861 variants under the AmericanAdmixture_4B18 stdpopsim model, mutation rate = 2.36e-08 /bp/gen)

Note

In simulated data you may not have missing data regions, and you may be able to pass erase_flanks=False to the preprocess_ts function.

The inference in this case is much more noisy (as illustrated using the original and inferred times of the node under each mutation):

Hide code cell source
# If there are multiple mutations at a site, we simply pick the first one

plot_real_vs_tsdate_times(
    [s.mutations[0].time for s in stdpopsim_ts.sites()],
    [s.mutations[0].metadata['mn'] for s in dated_ts.sites()],
    delta=100,
    alpha=0.1,
)
_images/579044fd82c1173cfd7de0be393083daca776414a0caa542185be9d3684a63bd.png

Note that if a bifurcating topology in the original genealogies has been collapsed into a polytomy (multifurcation) in the inferred version, then the default node times output by tsdate correspond to the time of the oldest bifurcating node. If you wish to consider diversity or coalescence over time, you should therefore consider explicitly setting the match_segregating_sites parameter to True (see the Rescaling details section).

Results and posteriors#

The default output of tsdate is a new, dated tree sequence, created with node times set to the (constrained) posterior mean time for each node, and mutation times set halfway between the time of the node above and the time of the node below the mutation.

However, these tree-sequence time arrays do not capture all the information about the distribution of times. They do not contain information about the variance in the time estimates, and if necessary, mean times are constrained to create a valid tree sequence (i.e. if the mean time of a child node is older than the mean time of all of its parents, a small value, \(\epsilon\), is added to the parent time to ensure a valid tree sequence).

For this reason, there are two ways to get variances and unconstrained dates when running tsdate:

  1. The nodes and mutations in the tree sequence will usually contain Metadata specifying mean time and its variance. These metadata values (currently saved as mn and vr) are not constrained by the topology of the tree sequence, and should be used in preference e.g. to nodes_time and mutations_time when evaluating the accuracy of tsdate.

  2. The return_fit parameter can be used when calling tsdate.date(), which then returns both the dated tree sequence and a fit object. This object can then be queried for the unconstrained posterior distributions using e.g. .node_posteriors() which can be read in to a pandas dataframe, as below:

import pandas as pd
redated_ts, fit = tsdate.date(
    sim_ts, mutation_rate=1e-6, return_fit=True)
posteriors_df = pd.DataFrame(fit.node_posteriors())  # mutation_posteriors() also available
posteriors_df.tail()  # Show the dataframe
mean variance
34 77.573465 26.359205
35 105.464187 30.020537
36 141.468529 52.913873
37 215.321172 56.009024
38 263.365664 69.361473

Timescale adjustment#

The default tsdate timescale is “generations”. Changing this can be as simple as providing a time_units argument:

mu_per_bp_per_gen = 1e-8  # per generation
ts_generations = tsdate.date(ts, mutation_rate=mu_per_bp_per_gen)

mu_per_bp_per_year = 3.4e-10  # Human generation time ~ 29 years
ts_years = tsdate.date(ts, mutation_rate=mu_per_bp_per_year, time_units="years")

However, if you are specifying a node-specific prior, e.g. because you are using a discrete-time method, you will also need to change the scale of the prior. In particular, if you are setting the prior using the population_size argument, you will also need to modify that by multiplying it by the generation time. For example:

Ne = 100  # Diploid population size
mu_per_bp_per_gen = 1e-8  # per generation
ts_generations = tsdate.inside_outside(ts, mutation_rate=mu_per_bp_per_gen, population_size=Ne)

# To infer dates in years, adjust both the rates and the population size:
generation_time = 29  # Years
mu_per_bp_per_year = mu_per_bp_per_gen / generation_time
ts_years = tsdate.inside_outside(
    ts,
    mutation_rate=mu_per_bp_per_year,
    population_size=Ne * generation_time,
    time_units="years"
)

# Check that the inferred node times are identical, just on different scales
assert np.allclose(ts_generations.nodes_time, ts_years.nodes_time / generation_time, 5)

Memory and run time#

Tsdate can be run on most modern computers. Using the default variational_gamma method, large tree sequences of millions or tens of millions of edges take tens of minutes and gigabytes of RAM (e.g. 10 GB / 50 mins on a 2024-era Apple M2 laptop for a tree sequence of 65 million edges covering 81 megabases of 2.85 million samples of human chromosome 17 from Anderson-Trocmé et al. [2023]).

_images/9bde3090077a4e80fc9d406edd4141068ae73f9609b2a2268d54737f059903f6.svg

Running the dating algorithm is linear in the number of edges in the tree sequence. This makes tsdate usable even for vary large tree sequences (e.g. millions of samples). For large instances, if you are running tsdate interactively, it can be useful to specify the progress option to display a progress bar telling you how long different stages of dating will take.

As the time taken to date a tree sequence using tsdate is only a fraction of that required to infer the initial tree sequence, the core tsdate algorithm has not been parallelised to allow running on many CPU cores.

Note

Extensive use of just-in-time compilation via numba means that it can take tens of seconds to load the tsdate module into python. See here for a workaround if this is causing you problems.

CLI use#

Computationally-intensive uses of tsdate are likely to involve running the program non-interactively, e.g. as part of an automated pipeline. In this case, it may be useful to use the command-line interface. See Command line interface for more details.