Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Metadata processing with Python

JSON metadata

If your metadata are generated in JSON format via serde (see here), then the metadata are simple to access from Python. The code repository for tskit-rust contains examples in the python/ subdirectory.

You may work with JSON metadata with or without a metadata schema (see here). A schema is useful for data validation but there is an unfortunate inefficiency if your input to Python is a tree sequence rather than a table collection. You will have to copy the tables, add the metadata schema, and regenerate a tree sequence. See the examples mentioned above.

Other formats

The tskit-python API only supports JSON and Python’s struct data formats. It is useful to use a format other than JSON in order to minimize storage requirements. However, doing so will require that you provide a method to covert the data into a valid Python object.

An easy way to provide conversion methods is to use pyo3 to create a small Python module to deserialize your metadata into Python objects. The tskit-rust code repository contains an example of this in the python/ subdirectory. The module is shown in its entirety below:

// NOTES:
// This code shows how to decode metadata generated in rust
// using a format that tskit-python does NOT support.
//
// This example works by creating rust structs that exactly mimic
// our metadata types and are exposed to Python.
// For production code, it would be wiser to reuse the rust types
// that first generated the metadata.
// We cannot do that here, else we'd have to publish the crates
// defining those types to crates.io/PyPi, which is just
// ecosystem pollution.
//
// Importantly, deserialization does not require that our
// input/output types be identical!
// Rather, they simply have to have the same fields.
// We exploit this fact here, which allows us to make
// new types with the same fields as our metadata.

use pyo3::prelude::*;

#[derive(serde::Serialize, serde::Deserialize)]
#[pyclass]
struct MutationMetadata {
    effect_size: f64,
    dominance: f64,
}

#[pymethods]
impl MutationMetadata {
    fn effect_size(&self) -> f64 {
        self.effect_size
    }
    fn dominance(&self) -> f64 {
        self.dominance
    }
}

#[derive(serde::Serialize, serde::Deserialize)]
#[pyclass]
struct IndividualMetadata {
    name: String,
    phenotypes: Vec<i32>,
}

#[pymethods]
impl IndividualMetadata {
    fn name(&self) -> String {
        self.name.clone()
    }
    fn phenotypes(&self) -> Vec<i32> {
        self.phenotypes.clone()
    }
}

/// Decode mutation metadata generated in rust via the `bincode` crate.
#[pyfunction]
fn decode_bincode_mutation_metadata(md: Vec<u8>) -> PyResult<MutationMetadata> {
    bincode::deserialize_from(md.as_slice())
        .map_err(|_| pyo3::exceptions::PyValueError::new_err("error decoding mutation metadata"))
}

/// Decode individual metadata generated in rust via the `bincode` crate.
#[pyfunction]
fn decode_bincode_individual_metadata(md: Vec<u8>) -> PyResult<IndividualMetadata> {
    bincode::deserialize_from(md.as_slice())
        .map_err(|_| pyo3::exceptions::PyValueError::new_err("error decoding individual metadata"))
}

/// A Python module implemented in Rust.
#[pymodule]
fn tskit_glue(m: &Bound<'_, PyModule>) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(decode_bincode_mutation_metadata, m)?)?;
    m.add_function(wrap_pyfunction!(decode_bincode_individual_metadata, m)?)?;
    Ok(())
}

Using it in Python is just a matter of importing the module:

import tskit
import tskit_glue
import numpy as np


def setup_ts_without_schema():
    ts = tskit.TreeSequence.load("with_bincode_metadata.trees")
    return ts


def test_individual_metadata():
    # NOTE: the assertions here rely on knowing
    # what examples/json_metadata.rs put into the
    # metadata!
    ts = setup_ts_without_schema()
    md = tskit_glue.decode_bincode_individual_metadata(ts.individual(0).metadata)
    assert md.name() == "Jerome"
    assert md.phenotypes() == [0, 1, 2, 0]


def test_mutation_metadata():
    # NOTE: the assertions here rely on knowing
    # what examples/json_metadata.rs put into the
    # metadata!
    ts = setup_ts_without_schema()
    md = tskit_glue.decode_bincode_mutation_metadata(ts.mutation(0).metadata)
    assert np.isclose(md.effect_size(), -1e-3)
    assert np.isclose(md.dominance(), 0.1)