Tree sequences and simulation#

Yan Wong, Georgia Tsambos, and Peter Ralph

Simulations are important in population genetics for many reasons:

Exploration

Simulations allow us to explore the influence of various historical scenarios on observed patterns of genetic variation and inheritance.

Benchmarking and evaluating methodologies

To assess the accuracy of inferential methods, we need test datasets for which the true values of important parameters are known.

Model training

Some methods for ancestry inference are trained on simulated data (eg. Approximate Bayesian Computation). This is especially important in studies of complex demographies, where there are many potential parameters and models, making it impractical to specify likelihood functions.

There are two major forms of population genetic simulation: forwards-time and backwards-time. In general, forwards-time simulation is detailed and more realistic, while backwards-time simulation is fast and efficient.

More specifically, apart from a few exceptions, backwards-time simulations are primarily focused on neutral simulations, while forward simulation is better suited to complex simulations, including those involving selection and continuous space.

Advantages of tree sequences#

Some forwards-time (SLiM, fwdpy) and backwards-time (msprime) simulators have a built-in capacity to output tree sequences. This can have several benefits:

  1. Neutral mutations, which often account for the majority of genetic variation, do not need to be tracked during the simulation, but can be added afterwards. See “Do you really need mutations?”.

  2. Tree sequences can be used as an interchange format to combine backwards and forwards simulations, allowing you to take advantage of the advantages of both approaches. This is detailed in Completing forwards simulations.

Some tips on simulation#

Even with fast modern software, simulating full genome sequences of entire populations can take some time. If you are finding your simulations too slow, it is worth benchmarking them by running on a range of shorter chromosomes or sample sizes, then extrapolating to figure out how long the simulations you actually want to run would take.

Todo

Add an example with a matplotlib fitted curve for some msprime simulations with e.g. a high recombination rate.

Collecting data from simulations that take minutes to a few hours and looking at the msprime paper for suggestions of what curve to fit to the data should give you good predictions. See issue #104