Tsinfer 0.5.0 changes

New in tsinfer 0.5.0: Greatly improved computational performance

Tsinfer 0.5.0 introduces an internal change to the first step of the tsinfer algorithm, ancestor generation, that greatly improves computational performance overall. We noticed that the ancestors we assume to be the oldest, corresponding to focal sites with a high frequency, were generated to be unrealistically long. By adjusting the algorithm to better utilise the genotype data during ancestor building, ancestor lengths no longer increase with their relative age. We can see this in the simulation results below of a 10mbp region with 7000 samples:

png

The changes in 0.5.0 greatly improve computational scaling of the most demanding steps of tsinfer (matching ancestors and matching samples). As seen in the simulation benchmark below of a 10mbp region, the improvement scales with sample size.

png

We have done extensive validation of the changes in 0.5.0 and have found no significant differences in inference accuracy. For example, we computed a weighted average of Kendall-Colijn and Robinsons-Foulds distances with a simulation of 1 mbp below, along with the recently published ARF dissimilarity score available in tscompare. For each sample size, 10 simulations were performed and the metrics were calculated for each. No significant difference between the performance of the versions was observed with any metric.

png

A detailed description and validation of the internal changes in 0.5.0 will be published in future. We strongly recommend upgrading to this version if you are using tsinfer on large datasets; we have observed speedups of over 8X in some cases.