Development#

If you would like to add some features to msprime, please read the following. If you think there is anything missing, please open an issue or pull request on GitHub!

Quickstart#

  • Make a fork of the msprime repo on GitHub

  • Clone your fork into a local directory, making sure that the submodules are correctly initialised:

    $ git clone git@github.com:tskit-dev/msprime.git --recurse-submodules
    

    For an already checked out repo, the submodules can be initialised using:

    $ git submodule update --init --recursive
    
  • Install the requirements.

  • Build the low level module by running make in the project root.

  • Run the tests to ensure everything has worked: python3 -m pytest. These should all pass.

  • Install the pre-commit checks: pre-commit install

  • Make your changes in a local branch. On each commit a pre-commit hook will run checks for code style and common problems. Sometimes these will report “files were modified by this hook” git add and git commit --amend will update the commit with the automatically modified version. The modifications made are for consistency, code readability and designed to minimise merge conflicts. They are guaranteed not to modify the functionality of the code. To run the checks without committing use pre-commit run. To bypass the checks (to save or get feedback on work-in-progress) use git commit --no-verify

  • If you have modifed the C code then clang-format -i lib/tests/* lib/!(avl).{c,h} will format the code to satisfy CI checks.

  • When ready open a pull request on GitHub. Please make sure that the tests pass before you open the PR, unless you want to ask the community for help with a failing test.

  • See the tskit documentation for more details on the recommended GitHub workflow.

Requirements#

System requirements#

To develop with msprime you will need to have GSL installed and a working compiler. Please see the pip install from source section for how to install GSL on some common platforms.

Important

You still need to install GSL and have a working compiler if you are working on the documentation because it requires a locally built version of the C module.

Python requirements#

The list of packages needed for development are listed in requirements/development.txt. Install these using either:

conda install --file requirements/development.txt

or

$ python -m pip install -r requirements/development.txt

depending on whether you are a conda or pip user.

Overview#

There are three main parts of msprime, in increasing order of complexity:

  1. High-level Python. The Python-API and command line interface tools are all defined in the msprime directory.

  2. C library. The underlying high-performance C code is written as a standalone library. All of the code for this library is in the lib directory.

  3. Low-level Python-C interface. The interface between the Python and C code is the msprime/_msprimemodule.c file, which defines the msprime._msprime module.

Each of these aspects has its own coding conventions and development tools, which are documented in the following sections.

Continuous integration tests#

Two different continuous integration providers are used, which run different combinations of tests on different platforms:

  1. Github actions run a variety of code style and quality checks using pre-commit along with Python tests on Linux, OSX and Windows. The docs are also built and a preview generated if changes are detected.

  2. CircleCI Runs all Python tests using the apt-get infrastructure for system requirements. Additionally, the low-level tests are run, coverage statistics calculated using CodeCov, and the documentation built.

Documentation#

The msprime manual exists to provide users with a comprehensive and authoritative source of information on msprime’s interfaces. From a high-level, the documentation is split into two main sections:

  1. The API Reference documentation provides a concise and precise description of a particular function or class. See the API Reference section for details.

  2. Thematically structured sections which discuss the functionality and explain features via minimal examples. See the Examples section for details.

Further documentation where features are combined to perform specific tasks is provided in the tskit tutorials site.

Documentation can be written on GitHub or locally. If you are new to contributing to msprime and you will be making minor edits to markdown text, you may find it easier to make edits on GitHub. To do this, hover your mouse over the GitHub icon in the top right corner of the documentation, and click “suggest edit”. You can then edit and preview markdown in GitHub’s user interface. Clicking “propose change” at the bottom of the page will commit to a new branch on your fork. If you do not already have a fork of the msprime repository, GitHub will prompt you to “fork this repository” - go ahead and create your fork. You can then create a pull request for your proposed change. On the other hand, if you are already familiar with contributing to msprime, or have more than simple markdown edits to add, you can edit and build the documentation locally. To do that, first follow the Quickstart. Once you have created and checked out a “topic branch”, you are ready to start editing the documentation.

Note

Please make sure you have built the low-level C module by running make in the project root directory before going any further. A lot of inscrutable errors are caused by a mismatch between the low-level C module installed in your system (or an older development version you previously compiled) and the local development version of msprime.

Building#

To build the documentation locally, go to the docs directory and run make (ensure that the Requirements have been installed and the low-level C module has been built—see the note in the previous section). This will build the HTML documentation in docs/_build/html/. You can now view the local build of the HTML in your local browser (if you do not know how to do this, try double clicking the HTML file).

Note

If you are having some problems with getting the documentation to build try running make clean which will delete all of the HTML and cached Jupyter notebook content.

JupyterBook#

Documentation for msprime is built using Jupyter Book, which allows us to mix API documentation generated automatically using Sphinx with code examples evaluated in a local Jupyter kernel. This is a very powerful system that allows us to generate beautiful and useful documentation, but it is quite new and has some quirks and gotchas. In particular, because of the mixture of API documentation and notebook content we need to write documentation using two different markup languages.

reStructuredText#

All of the documentation for previous versions of msprime was written using the reStructuredText format (rST) which is the default for Python documentation. Because of this, all of the API docstrings (see the API Reference section) are written using rST. Converting these docstrings to Markdown would be a lot of work (and support from upstream tools for Markdown dosctrings is patchy), and so we need to use rST for this purpose for the forseeable future.

Some of the directives we use are only available in rST, and so these must be enclosed in eval-rst blocks like so:

```{eval-rst}
.. autoclass:: msprime.StandardCoalescent
```

Markdown#

Everything besides API docstrings is written using MyST Markdown. This is a superset of common Markdown which enables executable Jupyter content to be included in the documentation. In particular, JupyterBook and MyST are built on top of Sphinx which allows us to do lots of cross-referencing.

Some useful links:

  • The MyST cheat sheet is a great resource.

  • The “Write Book Content” part of the Jupyter Book documentation has lots of helpful examples and links.

  • The MyST Syntax Guide is a good reference for the full syntax

  • Sphinx directives. Some of these will work with Jupyter Book, some won’t. There’s currently no comprehensive list of those that do. However, we tend to only use a small subset of the available directives, and you can usually get by following existing local examples.

  • The types of source files section in the Jupyter Book documentation is useful reference for mixing and matching Markdown and rST like we’re doing.

API Reference#

API reference documentation is provided by docstrings in the source code. These docstrings are written using reStructuredText and Sphinx.

Docstrings should be concise and precise. Examples should not be provided directly in the docstrings, but each significant parameter (e.g.) in a function should have links to the corresponding examples section.

Todo

Provide an example of a well-documented docstring with links to an examples section. We should use one of the simpler functions as an example of this.

Examples#

The API reference documentation is gives precise formal information about how to use a particular function of class. The rest of the manual should provide the discussion and examples needed to contextualise this information and help users to orient themselves. The examples section for a given feature (e.g., function parameter) should:

  • Provide some background into what this feature is for, so that an unsure reader can quickly orient themselves (external links to explain concepts is good for this).

  • Give examples using inline Jupyter code to illustrate the various different ways that this feature can be used. These examples should be as small as possible, so that the overall document runs quickly.

Juptyer notebook code is incluced by using blocks like this:

```{code-cell}
print("This is python code!)
a = list(range(10)
a
```

These cells behave exactly like they would in a Jupyter notebook (the whole document is actually treated and executed like one notebook)

Warning

For a document to be evaluated as a notebook you must have exactly the right YAML Frontmatter at the top of the file.

Cross referencing#

Cross referencing is done by using the {ref} inline role (see Jupyter Book documentation for more details) to link to labelled sections within the manual or to API documentation.

Sections within the manual should be labelled hierarchically, for example this section is labelled like this:

(sec_development_documentation_cross_referencing)=
### Cross referencing

Elsewhere in the Markdown documentation we can then refer to this section like:

See the {ref}`sec_development_documentation_cross_referencing` section for details.

Cross references like this will automatically use the section name as the link text, which we can override if we like:

See {ref}`another section<sec_development_documentation_cross_referencing>` for more information.

To refer to a given section from an rST docstring, we’d do something like

See the :ref:`sec_development_documentation_cross_referencing` section for more details.

When we want to refer to the API documentation for a function or class, we use the appropriate inline text role to do so. For example,

The {func}`.sim_ancestry` function lets us simulate ancestral histories.

It’s a good idea to always use this form when referring to functions or classes so that the reader always has direct access to the API documentation for a given function when they might it.

High-level Python#

Throughout this document, we assume that the msprime package is built and run locally within the project directory. That is, msprime is not installed into the Python installation using pip install -e or setuptools development mode. Please ensure that you build the low-level module using (e.g.) make and that the shared object file is in the msprime directory. This will have a name like _msprime.cpython-38-x86_64-linux-gnu.so, depending on your platform and Python version.

Conventions#

All Python code follows the PEP8 style guide, and is checked using the flake8 tool as part of the continuous integration tests. Black is used as part of the pre-commit hook for python code style and formatting.

Packaging#

msprime is packaged and distributed as Python module, and follows the current best-practices advocated by the Python Packaging Authority. The primary means of distribution is though PyPI, which provides the canonical source for each release.

A package for conda is also available on conda-forge.

Tests#

The tests for the high-level code are in the tests directory, and run using pytest. A lot of the simulation and basic tests are contained in the tests/test_highlevel.py file, but more recently smaller test files with more focussed tests are preferred (e.g., test_vcf.py, test_demography.py).

All new code must have high test coverage, which will be checked as part of the continuous integration tests by CodeCov.

Interfacing with low-level module#

Much of the high-level Python code only exists to provide a simpler interface to the low-level _msprime module. As such, many objects (such as RecombinationMap) are really just a shallow layer on top of the corresponding low-level object. The convention here is to keep a reference to the low-level object via a private instance variable such as self._ll_recombination_map.

Command line interfaces#

The command line interfaces for msprime are defined in the msprime/cli.py file. Each CLI has a single entry point (e.g. msp_main) which is invoked to run the program. These entry points are registered with setuptools using the console_scripts argument in setup.py, which allows them to be deployed as first-class executable programs in a cross-platform manner.

There are simple scripts in the root of the project (currently: msp_dev.py, mspms_dev.py) which are used for development. For example, to run the development version of mspms use python3 mspms_dev.py.

C Library#

The low-level code for msprime is written in C, and is structured as a standalone library. This code is all contained in the lib directory. Although the code is structured as a library, it is not intended to be used outside of the msprime project! The interfaces at the C level change considerably over time, and are deliberately undocumented.

Toolchain#

To compile and develop the C code, a few extra development libraries are needed. Libconfig is used for the development CLI and CUnit for unit tests. We use the meson build system in conjunction with ninja-build to compile the unit tests and development CLI. On Debian/Ubuntu, these can be installed using

$ sudo apt-get install libcunit1-dev libconfig-dev ninja-build

Meson is best installed via pip:

$ python3 -m pip install meson --user

On macOS rather than use apt-get for installation of these requirements a combination of homebrew and pip can be used (working as of 2020-01-15).

$ brew install cunit
$ python3 -m pip install meson --user
$ python3 -m pip install ninja --user

On macOS, conda builds are generally done using clang packages that are kept up to date:

$ conda install clang_osx-64  clangxx_osx-64

In order to make sure that these compilers work correctly (e.g., so that they can find other dependencies installed via conda), you need to compile msprime with this command on versions of macOS older than “Mojave”:

$ CONDA_BUILD_SYSROOT=/ python3 setup.py build_ext -i

On more recent macOS releases, you may omit the CONDA_BUILD_SYSROOT prefix.

Note

The use of the C toolchain on macOS is a moving target. The above advice was written on 23 January 2020 and was validated by a few msprime contributors. Caveat emptor, etc..

Compiling#

Meson keeps all compiled binaries in a build directory (this has many advantages such as allowing multiple builds with different options to coexist). It depends on a meson.build file which is in the lib directory. To set up the initial build directory, run

$ cd lib
$ meson build

The easiest way to compile the Unit Tests is to run ninja -C build. (Alternatively, you can cd into the build directory and run ninja). All the compiled binaries are then in the build directory, so to run, for example, the test_ancestry unit tests, use ./build/test_ancestry. A handy shortcut to compile the code and run all the unit tests is:

$ ninja -C build test

The mesonic plugin for vim simplifies this process and allows code to be compiled seamlessly within the editor.

Development CLI#

When developing the C code, it is usually best to use the development CLI to invoke the code. This is much simpler than going through the Python interface, and allows tools such as valgrind to be used directly. For example, when developing new simulation functionality, you should get the basic work done using the CLI and only move over to the Python API once you are reasonably sure that the code works properly.

The development CLI is written using libconfig to parse the simulation parameters file, and argtable3 to parse the command line arguments. The argtable3 code is included in the source (but not used in the distributed binaries, since this is strictly a development tool). The source code is in dev-tools/dev-cli.c.

After building, the CLI is run as follows:

$ ./build/dev-cli <command> <arguments>

Running the dev-cli program without arguments will print out a summary of the options.

The most important command for simulator development is simulate, which takes a configuration file as a parameter and writes the resulting simulation to an output file in the native .trees format. For example,

$ ./build/dev-cli simulate dev-tools/example.cfg -o out.trees

The development configuration file describes the simulation that we want to run, and uses the libconfig syntax. An example is given in the file dev-tools/example.cfg which should have sufficient documentation to be self-explanatory.

Unit Tests#

The C-library has an extensive suite of unit tests written using CUnit. These tests aim to establish that the low-level APIs work correctly over a variety of inputs, and particularly, that the tests don’t result in leaked memory or illegal memory accesses. The tests should be periodically run under valgrind to make sure of this.

Tests are defined in the tests directory, roughly split into suites defined in different files. For example, the tests associated with Fenwick trees are defined in the tests/tests_fenwick.c file. To run all the tests in this suite, use run using ./build/test_fenwick. To run a specific test in a particular suite, provide the name of the test name as a command line argument, e.g.:

$ ./build/test_fenwick test_fenwick_expand

While 100% test coverage is not feasible for C code, we aim to cover all code that can be reached. (Some classes of error such as malloc failures and IO errors are difficult to simulate in C.) Code coverage statistics are automatically tracked using CodeCov.

Code Style#

C code is formatted using clang-format with a custom configuration. To ensure that your code is correctly formatted, you can run

make clang-format

in the project root before submitting a pull request. Alternatively, you can run clang-format -i *.[c,h] in the lib directory.

Vim users may find the vim-clang-format plugin useful for automatically formatting code.

Coding conventions#

The code is written using the C99 standard. All variable declarations should be done at the start of a function, and functions kept short and simple where at all possible.

No global or module level variables are used for production code.

The code is organised following object-oriented principles. Each ‘class’ is defined using a struct, which encapsulates all the data it requires. Every ‘method’ on this class is then a function that takes this struct as its first parameter. Each class has an alloc method, which is responsible for allocating memory and a free method which frees all memory used by the object. For example, the Fenwick tree class is defined as follows:

typedef struct {
    size_t size;
    size_t log_size;
    double *tree;
    double *values;
} fenwick_t;

int fenwick_alloc(fenwick_t *self, size_t initial_size);
int fenwick_free(fenwick_t *self);
double fenwick_get_total(fenwick_t *self);

This defines the fenwick_t struct, and alloc and free methods and a method to return the total of the tree. Note that we follow the Python convention and use self to refer to the current instance.

Most objects also provide a print_state method, which is useful for debugging.

Todo

Change to intersphinx mapping for this link.

Please see the documentation for the tskit C API for more details on the how APIs are structured.

Error handling#

A critical element of producing reliable C programs is consistent error handling and checking of return values. All return values must be checked! In msprime, all functions (except the most trivial accessors) return an integer to indicate success or failure. Any negative value is an error, and must be handled accordingly. The following pattern is canonical:

    ret = msp_do_something(self, argument);
    if (ret != 0) {
        goto out;
    }
    // rest of function
out:
    return ret;

Here we test the return value of msp_do_something and if it is non-zero, abort the function and return this same value from the current function. This is a bit like throwing an exception in higher-level languages, but discipline is required to ensure that the error codes are propagated back to the original caller correctly.

Particular care must be taken in functions that allocate memory, because we must ensure that this memory is freed in all possible success and failure scenarios. The following pattern is used throughout for this purpose:

    double *x = NULL;

    x = malloc(n * sizeof(double));
    if (x == NULL) {
        ret = MSP_ERR_NO_MEMORY;
        goto out;
    }
    // rest of function
out:
    if (x != NULL) {
        free(x);
    }
    return ret;

It is vital here that x is initialised to NULL so that we are guaranteed correct behaviour in all cases. For this reason, the convention is to declare all pointer variables on a single line and to initialise them to NULL as part of the declaration.

Error codes are defined in err.h, and these can be translated into a message using msp_strerror(err).

Running valgrind#

Valgrind is an essential development tool, and is used extensively. (Being able to run valgrind was one of the motivating factors in the C-library architecture. It is difficult to run valgrind on a Python extension module, and so the simplest way to ensure that the low-level code is memory-tight is to separate it out into an independent library.)

Any new C unit tests that are written should be verified using valgrind to ensure that no memory is leaked. The entire test suite should be run through valgrind periodically also to detect any leaks or illegal memory accesses that have been overlooked.

Python C Interface#

The Python C interface is written using the Python C API and the code is in the msprime/_msprimemodule.c file. When compiled, this produces the msprime._msprime module, which is imported by the high-level module. The low-level Python module is not intended to be used directly and may change arbitrarily over time.

The conventions used within the low-level module here closely follow those in tskit; please see the documentation for more information.

Statistical tests#

To ensure that msprime is simulating the correct process we run many statistical tests. Since these tests are quite expensive (taking some hours to run) and difficult to automatically validate, they are not run as part of CI but instead as a pre-release sanity check. They are also very useful to run when developing new simulation functionality, as subtle statistical bugs can easily slip in unnoticed.

The statistical tests are all run via the verification.py script in the project root. The script has some extra dependencies listed in the requirements/verification.txt, which can be installed using pip install -r or conda install --file. Run this script using:

$ python3 verification.py

The statistical tests depend on compiled programs in the data directory. This includes a customised version of ms and a locally compiled version of scrm. These programs must be compiled before running the statistical tests, and can be built by running make in the data directory. If this is successful, there should be several binaries like ms and ms_summary_stats present in the data directory.

Please read the comments at the top of the verification.py script for details on how to write and run these tests.

Benchmarking#

Benchmarks to measure performance are in the benchmarks folder and are run using airspeed velocity. An automated system runs the benchmarks on each push to the main branch and uploads the results to this github pages site. These benchmarks can also be run locally to compare your branch with the main branch. Your changes must be in a commit to be measured. To run the benchmarks:

asv run asv run HEAD...main~1

This will run the benchmarks for the latest main branch commit and all commits on your current branch (the syntax for choosing commits is the same as git log). The following commands then make a browsable report (link given in output of the command):

asv publish
asv preview

Note the following tips:

  • Specifying the range of commits to run uses the same syntax as git log. For example, to run for a single commit, use asv run 88fbbc33^!

  • Be careful when running asv dev or using python=same as this can use the installed version of msprime rather than the local development version. This can lead to confusing results! When tuning benchmarks it’s better to commit often and use (e.g.) asv run HEAD^! --show-stderr -b Hudson.time_large_sample_size.

Containerization#

To run msprime in a container, see the installation instructions.

You can use docker to locally build an image, but it requires root access:

$ sudo docker build -t tskit/msprime .

podman can build and run images without root privilege.

$ podman build -t tskit/msprime .

Troubleshooting#

  • If make is giving you strange errors, or if tests are failing for strange reasons, try running make clean in the project root and then rebuilding.

  • Beware of multiple versions of the python library installed by different programs (e.g., pip versus installing locally from source)! In python, msprime.__file__ will tell you the location of the package that is being used.