Usage#
We’ll first generate a few “undated” tree sequences for later use:
Show code cell source
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning) # don't display FutureWarnings for stdpopsim
import msprime
import numpy as np
import stdpopsim
import tsinfer
import tskit
# Use msprime to create a simulated tree sequence with mutations, for demonstration
n = 10
Ne = 100
mu = 1e-6
ts = msprime.sim_ancestry(n, population_size=Ne, sequence_length=1e6, random_seed=123)
ts = msprime.sim_mutations(ts, rate=mu, random_seed=123, discrete_genome=False)
# Remove time information
tables = ts.dump_tables()
tables.nodes.time = np.where(tables.nodes.flags & tskit.NODE_IS_SAMPLE, 0, np.arange(ts.num_nodes, dtype=float))
tables.mutations.time = np.full(ts.num_mutations, tskit.UNKNOWN_TIME)
tables.time_units = tskit.TIME_UNITS_UNCALIBRATED
sim_ts = tables.tree_sequence()
# Use stdpopsim to create simulated genomes for inference
species = stdpopsim.get_species("HomSap")
model = species.get_demographic_model("AmericanAdmixture_4B11")
contig = species.get_contig("chr1", left=1e7, right=1.1e7, mutation_rate=model.mutation_rate)
samples = {"AFR": 5, "EUR": 5, "ASIA": 5, "ADMIX": 5}
# Create DNA sequences, stored in the tsinfer SampleData format
stdpopsim_ts = stdpopsim.get_engine("msprime").simulate(model, contig, samples, seed=123)
sample_data = tsinfer.SampleData.from_tree_sequence(stdpopsim_ts)
inf_ts = tsinfer.infer(sample_data)
if inf_ts.table_metadata_schemas.node.schema is None:
# Only needed because of tsinfer bug that doesn't set schemas: remove when fixed
tables = inf_ts.dump_tables()
tables.nodes.metadata_schema = tskit.MetadataSchema.permissive_json()
inf_ts = tables.tree_sequence()
print(f"* Simulated `sim_ts` ({2*n} genomes from a popn of {Ne} diploids, mut_rate={mu} /bp/gen)")
print(f"* Inferred `inf_ts` using tsinfer ({stdpopsim_ts.num_samples} samples of human {contig.origin})")
* Simulated `sim_ts` (20 genomes from a popn of 100 diploids, mut_rate=1e-06 /bp/gen)
* Inferred `inf_ts` using tsinfer (40 samples of human chr1:10000000-11000000)
Quickstart#
Given a known genetic genealogy in the form of a tree sequence, tsdate simply
re-estimates the node times based on the mutations on each edge. Usage is as
simple as calling the date()
function with an estimated per base pair
mutation rate.
import tsdate
# Running `tsdate` is usually a single function call, as follows:
mu_per_bp_per_generation = 1e-6
redated_ts = tsdate.date(sim_ts, mutation_rate=mu_per_bp_per_generation)
This simple example has no recombination, infinite sites mutation, a high mutation rate, and a known genealogy, so we would expect that the node times as estimated by tsdate from the mutations would be very close to the actual node times, as indeed they seem to be:
Show code cell source
from matplotlib import pyplot as plt
import numpy as np
def plot_real_vs_tsdate_times(x, y, ts_x=None, ts_y=None, delta = 0.1, **kwargs):
x, y = np.array(x), np.array(y)
plt.scatter(x + delta, y + delta, **kwargs)
plt.xscale('log')
plt.xlabel(f'Real time' + ('' if ts_x is None else f' ({ts_x.time_units})'))
plt.yscale('log')
plt.ylabel(f'Estimated time from tsdate' + ('' if ts_y is None else f' ({ts_y.time_units})'))
line_pts = np.logspace(np.log10(delta), np.log10(x.max()), 10)
plt.plot(line_pts, line_pts, linestyle=":")
plot_real_vs_tsdate_times(ts.nodes_time, redated_ts.nodes_time, ts, redated_ts)
Note
By default time is assumed to be measured in “generations”, but this can be changed by
using the time_units
parameter. If you are using the alternative discrete-time methods
you will also need to adjust the prior, as described in the
Timescale adjustment section.
Inferred topologies#
A more typical use-case is where the genealogy has been inferred from DNA sequence data,
for example by tsinfer or
Relate. Below we will demonstrate with tsinfer
output based on DNA sequences generated by a more realistic simulation.
With real data, especially from tsinfer
you may want to preprocess
the tree sequence before dating. This removes regions with no variable sites,
also simplifies to remove locally unary portions of nodes, and splits disjoint nodes
into separate datable nodes (see the
Numerical stability and preprocessing section below for more details)
import tsdate
simplified_ts = tsdate.preprocess_ts(inf_ts)
dated_ts = tsdate.date(simplified_ts, mutation_rate=model.mutation_rate)
print(
f"Dated `inf_ts` (inferred from {inf_ts.num_sites} variants under the {model.id}",
f"stdpopsim model, mutation rate = {model.mutation_rate} /bp/gen)"
)
Dated `inf_ts` (inferred from 3912 variants under the AmericanAdmixture_4B11 stdpopsim model, mutation rate = 2.36e-08 /bp/gen)
Note
In simulated data you may not have missing data regions, and you may be able to
pass remove_telomeres=False
to the preprocess_ts
function.
The inference in this case is much more noisy (as illustrated using the original and inferred times of the node under each mutation):
Show code cell source
# If there are multiple mutations at a site, we simply pick the first one
plot_real_vs_tsdate_times(
[s.mutations[0].time for s in stdpopsim_ts.sites()],
[s.mutations[0].metadata['mn'] for s in dated_ts.sites()],
delta=100,
alpha=0.1,
)
Results and posteriors#
The default output of tsdate is a new, dated tree sequence, created with node times changed to the posterior mean time for each node, and mutation times set halfway between the node above and below.
However, these tree-sequence time arrays do not capture all the information about the distribution of times. They do not contain information about the variance in the time estimates, and even the mean times must be constrained to create a valid tree sequence. For example, a mutation time must be bounded between the time of the node above and the node below, and if the mean time of a child node is older than the mean time of all of its parents, a small value, \(\epsilon\), is added to the parent time to ensure a valid tree sequence.
For this reason, there are two ways to get additional, unconstrained dates when running tsdate:
The nodes and mutations in the tree sequence will usually contain Metadata specifying true mean time and the variance in mean time. The metadata values (currently saved as
mn
andvr
) need not be constrained by the topology of the tree sequence, and should be used in preference e.g. tonodes_time
andmutations_time
when evaluating the accuracy of tsdate.The
return_posteriors
parameter can be used when callingtsdate.date()
, which then returns both the dated tree sequence and a dictionary specifying the posterior distributions.
Timescale adjustment#
The default tsdate timescale is “generations”. Changing this can be as simple as:
mu_per_bp_per_gen = 1e-8 # per generation
ts_generations = tsdate.date(ts, mutation_rate=mu_per_bp_per_gen)
mu_per_bp_per_year = 3.4e-10 # Human generation time ~ 29 years
ts_years = tsdate.date(ts, mutation_rate=mu_per_bp_per_year, time_units="years")
However, if you are specifying a node-specific prior, e.g. because you are using a
discrete-time method, you will also need to change the scale of the prior. In particular,
if you are setting the prior using the population_size
argument, you will also need to
modify that by multiplying it by the generation time. For example:
Ne = 100 # Diploid population size
mu_per_bp_per_gen = 1e-8 # per generation
ts_generations = tsdate.inside_outside(ts, mutation_rate=mu_per_bp_per_gen, population_size=Ne)
# To infer dates in years, adjust both the rates and the population size:
generation_time = 29 # Years
mu_per_bp_per_year = mu_per_bp_per_gen / generation_time
ts_years = tsdate.inside_outside(
ts,
mutation_rate=mu_per_bp_per_year,
population_size=Ne * generation_time,
time_units="years"
)
# Check that the inferred node times are identical, just on different scales
assert np.allclose(ts_generations.nodes_time, ts_years.nodes_time / generation_time, 5)
Real data#
Real world data is likely to consist of larger datasets than in the example, and
may exhibit issues that are not present in simulated data and can e.g. cause numerical
instability and other problems. In addition to preprocessing
below we detail some other solutions to problems common in real datasets.
Memory and run time#
Tsdate can be run on most modern computers: large tree sequences of millions or tens of millions of edges will take of the order of hours, and use tens of GB of RAM (e.g. 24 GB / 1 hour on a 2022-era laptop for a tree sequence of 5 million edges covering 60 megabases of 7500 samples of human chromosome 6 from Wohns et al. [2022]).
Todo
Add some scaling plots. Some real-world examples: a dataset of 10K samples of half a
million sites (~4M edges) on one node of a
2023 Intel Platinum cluster takes ~30 mins (20GB max memory) for the inside-outside
method and ~10 mins (1.5GB max memory) using the variational_gamma
method.
Running the dating algorithm is linear in the number of edges in the tree sequence.
This makes tsdate usable even for vary large tree sequences (e.g. millions of samples).
For large instances, if you are running tsdate interactively, it can be useful to
specify the progress
option to display a progress bar telling you how long
different stages of dating will take.
The time taken to date a tree sequence using tsdate is only a fraction of that required to infer the initial tree sequence, therefore the core tsdate algorithm has not been parallelised to allow running on many CPU cores.
Continuous time optimisations#
The expectation progagation approach
in the variational_gamma
method involves iteratively refining the
local time estimates. The number of rounds of iteration can be
set via the max_iterations
parameter. Reducing this will speed up tsdate
inference, but may produce worse date estimation. Future updates to tsdate may
optimise this so that iterations are halted after an appropriate convergence criterion
has been reached.
Continuous time optimisations#
If the method used for dating involves discrete time slices, tsdate scales
quadratically in the number of time slices chosen. For greater temporal resolution,
you are thus advised to keep with the variational_gamma
method, which does not discretise time.
Some precalculation of regularly reused mutational likelihoods can be parallelised
easily: this step can be sped up by specifying the num_threads
parameter to
date()
(however, this behaviour is subject to change in future versions).
For discrete-time methods, before the dating algorithm is run the conditional
coalescent prior distribution must be calculated for each node.
Although this is roughly linear in the number of nodes,
it can still take some time if there are millions of nodes.
To speed up this process an approximation to the conditional coalescent
is used for nodes which have a large number of descendants
(the exact number can be modified by making a prior
using the approx_prior_size
parameter).
The results of this approximation are cached, so that tsdate will run slightly faster
the second time it is run.
CLI use#
Computationally-intensive uses of tsdate are likely to involve running the program non-interactively, e.g. as part of an automated pipeline. In this case, it may be useful to use the command-line interface. See Command line interface for more details.
Numerical stability and preprocessing#
Numerical stability issues witll manifest themselves by raising an error when dating. They are usually caused by “bad” tree sequences (i.e. pathological combinations of topologies and mutations). These can be caused, for example, by long deep branches with very few mutations, such as samples attaching directly to a local root. These can often be fixed by removing bad regions of the tree sequence, e.g. regions that have no variation because they are unmappable or have been removed by a QC filter, such as at the centromere.
In addition, the tsinfer
algorithm produces can overestimate the span of genome
covered by some nodes, especially those at ancient times. This is particularly
seen when nodes contain “locally unary” sections (i.e. where the node only has
one child in some trees), or where a node is “disjoint”, disappearing from one tree
to reappear in a tree in a more distant region of the genome. This extended node
span can be another source of numerical instability and inaccuracy. Locally unary
sections of nodes can be removed by simplification, and disjoint nodes can be split
using tsdate.util.split_disjoint_nodes()
.
The tsdate.preprocess_ts()
function carries out all these procedures in
a single function call: it removes data-poor regions, simplifies, and splits
disjoint nodes. See the documentation for that function for details on how to
increase or decrease its stringency.
Note
If unary regions are correctly estimated, they can help improve dating slightly.
There is therefore a specific route to date a tree sequence containing locally unary
nodes. For example, for discrete time methods, you can use the allow_unary
option
when building a prior.