Real data#

Real world data is likely to consist of larger datasets than in the Usage examples, and may exhibit issues that are not present in simulated data such as numerical instability and other problems.

In particular, two issues can have a large influence when dating tree sequences inferred by tools such as tsinfer:

  1. An inferred tree sequence may have too many edge constraints, for instance if multiple ancestors in the true genealogy are wrongly combined into a single node in the inferred tree sequence, or if a parent-child relationship (i.e. an edge) is wrongly inferred between two nodes.

  2. Inferred tree sequences may contain polytomies (multifurcations) indicating uncertainty about the order of branching in the original genealogy. In this case, it is possible to choose the polytomy date to represent either the oldest of the contributing branch points, or the expected pairwise tMRCA between the lineages contributing to the polytomy. The first provides a more accurate estimate of both mutation dates and the date of nodes under each mutation. The second provides a better estimate of coalescence times and edge areas / total branch lengths.

Issue 1 can be ameliorated by fully simplifying the tree sequence, and breaking up nodes (e.g. by splitting nodes into separate ancestors: see Preprocessing). Using the default variational_gamma method, issue 2 can be addressed during the rescaling step by explicitly setting match_segregating_sites to either True or False (see Rescaling details below)

Preprocessing#

For real datasets, you are advised to run preprocess_ts() on your inferred tree sequence before dating. This removes regions with no variable sites, simplifies the tree sequence to remove locally unary portions of nodes, and splits disjoint nodes into separate datable nodes. Not only can this improve the accuracy of dating, but it also increases Numerical stability. See the documentation for tsdate.preprocess_ts() for details on how to increase or decrease its stringency.

In particular, the tsinfer algorithm can overestimate the span of genome covered by some nodes, especially those at ancient times. This is particularly seen when nodes contain “locally unary” sections (i.e. where the node only has one child in some trees), or where a node is “disjoint”, disappearing from one tree to reappear in a tree in a more distant region of the genome. This extended node span can cause problems by tying together distant parts of the genealogy. Preprocessing removes locally unary sections of nodes via simplification, and splits disjoint nodes using the tsdate.util.split_disjoint_nodes() functionality.

Note

If unary regions are correctly estimated, they can help improve dating slightly. You can set the allow_unary=True option to run tsdate on such tree sequences.

Continuous time considerations#

In general, when analysing real data with tsdate, we recommended using the default tsdate.variational_gamma() method. This has several parameters which can be adjusted. The max_iterations parameter simply adjusts the number of rounds of Expectation propagation, and is usually sufficiently large to give reasonable results on real data. The other parameters concern rescaling.

Rescaling details#

Tsdate uses a modified version of the rescaling algorithm introduced in SINGER [Deng et al., 2024] that works on gamma natural parameters rather than point estimates. Rescaling can help to deal with the effects of variable population size though time (see Population size). Currently, three parameters can be set:

  • match_segregating_sites. This has the largest effect if tree sequences contain polytomies.

    1. When explicitly set to True, times within intervals are scaled such that the average mutational distance between pairs of samples best matches the mutation rate. This is similar to the approach in SINGER, and is appropriate if, for example, you are interested in estimating coalescence rates over time, or average divergent times (MRCAs) between samples. In the resulting dated tree sequence, polytomy dates will correspond to the average distance between the pairs of lineages contributing to the polytomy.

    2. By default, match_segregating_sites=False, meaning that times within intervals are simultaneously scaled such that the expected density of mutations along each path from a sample to the root best matches the mutational density predicted from the user-provided mutation rate. This is most suitable if you wish to estimate dates of mutations or dates of identifiable common ancestors. In the resulting dated tree sequence, polytomy dates will correspond to the oldest ancestor that contributes to the polytomy.

  • rescaling_intervals. This sets the number of time intervals over which rescaling takes place. To turn off rescaling entirely, a value of 0 can be provided; however, resulting dates may be less accurately estimated if the dataset comes from a set of samples with a complex demographic history (see Population size)

  • rescaling_iterations: several rounds of rescaling are carried out. Normally very few iterations are required to reach a stable set of rescaled intervals.

Todo

Describe the rescaling step in more detail. Could also link to the population size docs

Discrete time considerations#

A few parameters can be set to speed up discrete-time methods.

Note

For Discrete-time methods, tsdate scales quadratically in the number of time slices. To increase speed or temporal resolution, you are thus advised to keep with the Continuous-time variational_gamma method.

For discrete-time methods, before the dating algorithm is run the conditional coalescent prior distribution must be calculated for each node. Although this is roughly linear in the number of nodes, it can still take some time if there are millions of nodes. To speed up this process an approximation to the conditional coalescent is used for nodes which have a large number of descendants (the exact number can be modified by making a prior using the approx_prior_size parameter). The results of this approximation are cached, so that tsdate will run slightly faster the second time it is run.

Regularly reused mutational likelihoods are precalculated in the dicrete-time methods. This precalculation can be parallelised by specifying the num_threads parameter to tsdate.inside_outside() and tsdate.maximization(). However, this behaviour is subject to change in future versions.

Numerical stability#

When passing messages between nodes, it is possible that the node time updates are wildly incompatible with each other, for instance if a focal node is simultaneoulsy attached to one small edge with many mutations and another large edge with few mutations. The variational_gamma method incorporates a form of “damping” which encourages gradual convergence to reasonable node times, but it may still be the case that the tree sequence topology contains, for example, long deep branches with very few mutations, such as samples attaching directly to a local root.

Such “bad” tree sequences (caused by pathological combinations of topologies and mutations) can result in tsdate raising an error when dating. Issues of this nature are collected in this set of GitHub issues. They can often be fixed by removing bad regions of the tree sequence, e.g. regions that have no variation because they are unmappable or have been removed by a QC filter, such as at the centromere. Preprocessing the tree sequence can remove such regions, as well as cutting ties between nodes by removing locally unary regions and splitting disjoint nodes, which can also cause stability problems.

If numerical issues still persist, this is likely to be a sign that the tree sequence topology has been poorly inferred, and you are encouraged to examine it in detail before proceeding. Running the tsdate.maximization() method should always work, but may not give accurate results.

It is also possible for tsdate to have issues when rescaling, e.g. if there is not enough information within a rescaling interval. Setting the rescaling_intervals parameter to a smaller value, or omitting rescaling entirely, should allow tsdate to run to completion.