Real data#
Real world data is likely to consist of larger datasets than in the Usage examples, and may exhibit issues that are not present in simulated data such as numerical instability and other problems.
In particular, two issues can have a large influence when dating tree sequences inferred by tools such as tsinfer:
An inferred tree sequence may have too many edge constraints, for instance if multiple ancestors in the true genealogy are wrongly combined into a single node in the inferred tree sequence, or if a parent-child relationship (i.e. an edge) is wrongly inferred between two nodes.
Inferred tree sequences may contain polytomies (multifurcations) indicating uncertainty about the order of branching in the original genealogy. In this case, it is possible to choose the polytomy date to represent either the oldest of the contributing branch points, or the expected pairwise tMRCA between the lineages contributing to the polytomy. The first provides a more accurate estimate of both mutation dates and the date of nodes under each mutation. The second provides a better estimate of coalescence times and edge areas / total branch lengths.
Issue 1 can be ameliorated by fully simplifying the tree sequence, and breaking up
nodes (e.g. by splitting nodes into separate ancestors: see
Preprocessing). Using the default variational_gamma
method,
issue 2 can be addressed during the rescaling step by explicitly setting match_segregating_sites
to either True
or False
(see Rescaling details below)
Preprocessing#
For real datasets, you are advised to run preprocess_ts()
on your inferred tree
sequence before dating. This removes regions with no variable sites, simplifies the
tree sequence to remove locally unary portions of nodes, and splits disjoint nodes into
separate datable nodes. Not only can this improve the accuracy of dating, but it
also increases Numerical stability. See the documentation for
tsdate.preprocess_ts()
for details on how to increase or decrease its stringency.
In particular, the tsinfer
algorithm can overestimate the span of genome
covered by some nodes, especially those at ancient times. This is particularly
seen when nodes contain “locally unary” sections (i.e. where the node only has
one child in some trees), or where a node is “disjoint”, disappearing from one tree
to reappear in a tree in a more distant region of the genome. This extended node
span can cause problems by tying together distant parts of the genealogy. Preprocessing
removes locally unary sections of nodes via simplification, and splits disjoint nodes
using the tsdate.util.split_disjoint_nodes()
functionality.
Note
If unary regions are correctly estimated, they can help improve dating slightly.
You can set the allow_unary=True
option to run tsdate on such tree sequences.
Continuous time considerations#
In general, when analysing real data with tsdate, we recommended using
the default tsdate.variational_gamma()
method. This
has several parameters which can be adjusted. The max_iterations
parameter simply
adjusts the number of rounds of Expectation propagation, and
is usually sufficiently large to give reasonable results on real data. The
other parameters concern rescaling.
Rescaling details#
Tsdate
uses a modified version of the rescaling algorithm introduced in
SINGER [Deng et al., 2024]
that works on gamma natural parameters rather than point estimates.
Rescaling can help to deal with the effects of variable population size
though time (see Population size). Currently, three parameters can be set:
match_segregating_sites
. This has the largest effect if tree sequences contain polytomies.When explicitly set to
True
, times within intervals are scaled such that the average mutational distance between pairs of samples best matches the mutation rate. This is similar to the approach in SINGER, and is appropriate if, for example, you are interested in estimating coalescence rates over time, or average divergent times (MRCAs) between samples. In the resulting dated tree sequence, polytomy dates will correspond to the average distance between the pairs of lineages contributing to the polytomy.By default,
match_segregating_sites=False
, meaning that times within intervals are simultaneously scaled such that the expected density of mutations along each path from a sample to the root best matches the mutational density predicted from the user-provided mutation rate. This is most suitable if you wish to estimate dates of mutations or dates of identifiable common ancestors. In the resulting dated tree sequence, polytomy dates will correspond to the oldest ancestor that contributes to the polytomy.
rescaling_intervals
. This sets the number of time intervals over which rescaling takes place. To turn off rescaling entirely, a value of0
can be provided; however, resulting dates may be less accurately estimated if the dataset comes from a set of samples with a complex demographic history (see Population size)rescaling_iterations
: several rounds of rescaling are carried out. Normally very few iterations are required to reach a stable set of rescaled intervals.
Todo
Describe the rescaling step in more detail. Could also link to the population size docs
Discrete time considerations#
A few parameters can be set to speed up discrete-time methods.
Note
For Discrete-time methods, tsdate scales
quadratically in the number of time slices. To increase speed or temporal resolution,
you are thus advised to keep with the Continuous-time
variational_gamma
method.
For discrete-time methods, before the dating algorithm is run the conditional
coalescent prior distribution must be calculated for each node.
Although this is roughly linear in the number of nodes,
it can still take some time if there are millions of nodes.
To speed up this process an approximation to the conditional coalescent
is used for nodes which have a large number of descendants
(the exact number can be modified by making a prior
using the approx_prior_size
parameter).
The results of this approximation are cached, so that tsdate will run slightly faster
the second time it is run.
Regularly reused mutational likelihoods are precalculated in the dicrete-time methods.
This precalculation can be parallelised by specifying the num_threads
parameter to
tsdate.inside_outside()
and tsdate.maximization()
. However, this
behaviour is subject to change in future versions.
Numerical stability#
When passing messages between nodes, it is possible that the node time updates are
wildly incompatible with each other, for instance if a focal node is simultaneoulsy
attached to one small edge with many mutations and another large edge with few
mutations. The variational_gamma
method incorporates a form of “damping” which
encourages gradual convergence to reasonable node times, but it may still be the case
that the tree sequence topology contains, for example, long deep branches with very few mutations, such as samples attaching directly to a local root.
Such “bad” tree sequences (caused by pathological combinations of topologies and mutations) can result in tsdate raising an error when dating. Issues of this nature are collected in this set of GitHub issues. They can often be fixed by removing bad regions of the tree sequence, e.g. regions that have no variation because they are unmappable or have been removed by a QC filter, such as at the centromere. Preprocessing the tree sequence can remove such regions, as well as cutting ties between nodes by removing locally unary regions and splitting disjoint nodes, which can also cause stability problems.
If numerical issues still persist, this is likely to be a sign that the
tree sequence topology has been poorly inferred, and you are encouraged
to examine it in detail before proceeding. Running the
tsdate.maximization()
method should always work, but may not give
accurate results.
It is also possible for tsdate to have issues when
rescaling, e.g. if there is not enough
information within a rescaling interval. Setting the rescaling_intervals
parameter to a smaller value, or omitting rescaling entirely, should
allow tsdate to run to completion.