Parallelization

Parallelization#

When performing large calculations it’s often useful to split the work over multiple processes or threads. The tskit API can be used without issues across multiple processes, and the Python multiprocessing module often provides a very effective way to work with many replicate simulations in parallel.

When we wish to work with a single very large dataset, however, threads can offer better resource usage because of the shared memory space. The Python threading library gives a very simple interface to lightweight CPU threads and allows us to perform several CPU intensive tasks in parallel. The tskit API is designed to allow multiple threads to work in parallel when CPU intensive tasks are being undertaken.

Note

In the CPython implementation the Global Interpreter Lock ensures that only one thread executes Python bytecode at one time. This means that Python code does not parallelise well across threads, but avoids a large number of nasty pitfalls associated with multiple threads updating data structures in parallel. Native C extensions like numpy and tskit release the GIL while expensive tasks are being performed, therefore allowing these calculations to proceed in parallel.

Todo

This tutorial previously used code with an old interface, and hence has been removed. We must recreate an example of parallel processing, giving examples of both threads and processes (but see this stackoverflow post for why it may be difficult to get multiprocessing working in this notebook). A reasonable example might be to calculate many pairwise statistics between sample sets in parallel.

We should also show how, for large tree sequences that it is better to pass the filenames to each subprocess, and load the tree sequence, rather than transferring the entire tree sequence (via pickle) to the subprocesses.