Parallelization#
When performing large calculations it’s often useful to split the
work over multiple processes or threads. The tskit
API can
be used without issues across multiple processes, and the Python
multiprocessing
module often provides a very effective way to
work with many replicate simulations in parallel.
When we wish to work with a single very large dataset, however, threads can
offer better resource usage because of the shared memory space. The Python
threading
library gives a very simple interface to lightweight CPU
threads and allows us to perform several CPU intensive tasks in parallel. The
tskit
API is designed to allow multiple threads to work in parallel when
CPU intensive tasks are being undertaken.
Note
In the CPython implementation the
Global Interpreter Lock ensures that
only one thread executes Python bytecode at one time. This means that
Python code does not parallelise well across threads, but avoids a large
number of nasty pitfalls associated with multiple threads updating
data structures in parallel. Native C extensions like numpy
and tskit
release the GIL while expensive tasks are being performed, therefore
allowing these calculations to proceed in parallel.
Todo
This tutorial previously used code with an old interface, and hence has been removed.
We must recreate an example of parallel processing, giving examples of both
threads and processes (but see
this stackoverflow post
for why it may be difficult to get multiprocessing
working in this notebook).
A reasonable example might be to calculate many pairwise statistics between sample sets
in parallel.
We should also show how, for large tree sequences that it is better to pass the filenames to each subprocess, and load the tree sequence, rather than transferring the entire tree sequence (via pickle) to the subprocesses.