Working with Metadata

Metadata is information associated with entities that tskit doesn’t use or interpret, but which is useful to pass on to downstream analysis such as sample ids, dates etc. (see Metadata for a full discussion). Each table has a MetadataSchema which details the contents and encoding of the metadata for each row. A metadata schema is a JSON document that conforms to JSON Schema (The full schema for tskit is at Full metaschema). Here we use an example tree sequence which contains some demonstration metadata:

import tskit
import json

ts = tskit.load("data/metadata.trees")

Reading metadata and schemas

Metadata is automatically decoded using the schema when accessed via a TreeSequence or TableCollection Python API. For example:

print("Metadata for individual 0:", ts.individual(0).metadata)  # Tree sequence access
print("Metadata for individual 0:", ts.tables.individuals[0].metadata)  # Table access
Metadata for individual 0: {'accession': 'ERS0001', 'pcr': True}
Metadata for individual 0: {'accession': 'ERS0001', 'pcr': True}

Viewing the MetadataSchema for a table can help with understanding its metadata, as it can contain descriptions and constraints:

ts.table_metadata_schemas.individual
{"additionalProperties":false,"codec":"json","properties":{"accession":{"description":"ENA accession number","type":"string"},"pcr":{"description":"Was PCR used on this sample","name":"PCR Used","type":"boolean"}},"required":["accession","pcr"],"type":"object"}

The same schema can be accessed via a metadata_schema attribute on each table (printed prettily here using json.dumps)

schema = ts.tables.individuals.metadata_schema
print(json.dumps(schema.asdict(), indent=4))  # Print with indentations
{
    "additionalProperties": false,
    "codec": "json",
    "properties": {
        "accession": {
            "description": "ENA accession number",
            "type": "string"
        },
        "pcr": {
            "description": "Was PCR used on this sample",
            "name": "PCR Used",
            "type": "boolean"
        }
    },
    "required": [
        "accession",
        "pcr"
    ],
    "type": "object"
}

The top-level metadata and schemas for the entire tree sequence are similarly accessed with TreeSequence.metadata and TreeSequence.metadata_schema.

Note

If there is no schema (i.e. it is equal to MetadataSchema(None)) for a table or top-level metadata, then no decoding is performed and bytes will be returned.

Modifying metadata and schemas

If you are creating or modifying a tree sequence by changing the underlying tables, you may want to record or add to the metadata. If the change fits into the same schema, this is relatively simple, you can follow the description of minor table edits in the Tables and editing tutorial. However if it requires a change to the schema, this must be done first, as it is then used to validate and encode the metadata.

Schemas in tskit are held in a MetadataSchema. A Python dict representation of the schema is passed to its constructor, which will validate the schema. Here are a few examples: the first one allows arbitrary fields to be added, the second one (which will construct the schema we printed above) does not:

basic_schema = tskit.MetadataSchema({'codec': 'json'})

complex_schema = tskit.MetadataSchema({
    'codec': 'json',
    'additionalProperties': False,
    'properties': {'accession': {'description': 'ENA accession number',
                                 'type': 'string'},
                   'pcr': {'description': 'Was PCR used on this sample',
                           'name': 'PCR Used',
                           'type': 'boolean'}},
    'required': ['accession', 'pcr'],
    'type': 'object',
})

This MetadataSchema can then be assigned to a table or the top-level tree sequence e.g. metadata_schema:

tables = tskit.TableCollection(sequence_length=1)  # make a new, empty set of tables
tables.individuals.metadata_schema = complex_schema

This will overwrite any existing schema. Note that this will not validate any existing metadata against the new schema. Now that the table has a schema, calls to add_row() will validate and encode the metadata:

row_id = tables.individuals.add_row(0, metadata={"accession": "Bob1234", "pcr": True})
print(f"Row {row_id} added to the individuals table")
Row 0 added to the individuals table

If we try to add metadata that doesn’t fit the schema, such as accidentally using a string instead of a proper Python boolean, we’ll get an error:

tables.individuals.add_row(0, metadata={"accession": "Bob1234", "pcr": "false"})
---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/tskit/metadata.py:679, in MetadataSchema.validate_and_encode_row(self, row)
    678 try:
--> 679     self._validate_row(row)
    680 except jsonschema.exceptions.ValidationError as ve:

File /opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/jsonschema/validators.py:353, in create.<locals>.Validator.validate(self, *args, **kwargs)
    352 for error in self.iter_errors(*args, **kwargs):
--> 353     raise error

ValidationError: 'false' is not of type 'boolean'

Failed validating 'type' in schema['properties']['pcr']:
    OrderedDict([('description', 'Was PCR used on this sample'),
                 ('name', 'PCR Used'),
                 ('type', 'boolean')])

On instance['pcr']:
    'false'

The above exception was the direct cause of the following exception:

MetadataValidationError                   Traceback (most recent call last)
Input In [9], in <cell line: 1>()
----> 1 tables.individuals.add_row(0, metadata={"accession": "Bob1234", "pcr": "false"})

File /opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/tskit/tables.py:885, in IndividualTable.add_row(self, flags, location, parents, metadata)
    883 if metadata is None:
    884     metadata = self.metadata_schema.empty_value
--> 885 metadata = self.metadata_schema.validate_and_encode_row(metadata)
    886 return self.ll_table.add_row(
    887     flags=flags, location=location, parents=parents, metadata=metadata
    888 )

File /opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/tskit/metadata.py:681, in MetadataSchema.validate_and_encode_row(self, row)
    679         self._validate_row(row)
    680     except jsonschema.exceptions.ValidationError as ve:
--> 681         raise exceptions.MetadataValidationError(str(ve)) from ve
    682 return self.encode_row(row)

MetadataValidationError: 'false' is not of type 'boolean'

Failed validating 'type' in schema['properties']['pcr']:
    OrderedDict([('description', 'Was PCR used on this sample'),
                 ('name', 'PCR Used'),
                 ('type', 'boolean')])

On instance['pcr']:
    'false'

and because we set additionalProperties to False in the schema, an error is also raised if we attempt to add new fields:

tables.individuals.add_row(0, metadata={"accession": "Bob1234", "pcr": True, "newKey": 25})
---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/tskit/metadata.py:679, in MetadataSchema.validate_and_encode_row(self, row)
    678 try:
--> 679     self._validate_row(row)
    680 except jsonschema.exceptions.ValidationError as ve:

File /opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/jsonschema/validators.py:353, in create.<locals>.Validator.validate(self, *args, **kwargs)
    352 for error in self.iter_errors(*args, **kwargs):
--> 353     raise error

ValidationError: Additional properties are not allowed ('newKey' was unexpected)

Failed validating 'additionalProperties' in schema:
    OrderedDict([('additionalProperties', False),
                 ('codec', 'json'),
                 ('properties',
                  OrderedDict([('accession',
                                OrderedDict([('description',
                                              'ENA accession number'),
                                             ('type', 'string')])),
                               ('pcr',
                                OrderedDict([('description',
                                              'Was PCR used on this '
                                              'sample'),
                                             ('name', 'PCR Used'),
                                             ('type', 'boolean')]))])),
                 ('required', ['accession', 'pcr']),
                 ('type', 'object')])

On instance:
    {'accession': 'Bob1234', 'newKey': 25, 'pcr': True}

The above exception was the direct cause of the following exception:

MetadataValidationError                   Traceback (most recent call last)
Input In [10], in <cell line: 1>()
----> 1 tables.individuals.add_row(0, metadata={"accession": "Bob1234", "pcr": True, "newKey": 25})

File /opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/tskit/tables.py:885, in IndividualTable.add_row(self, flags, location, parents, metadata)
    883 if metadata is None:
    884     metadata = self.metadata_schema.empty_value
--> 885 metadata = self.metadata_schema.validate_and_encode_row(metadata)
    886 return self.ll_table.add_row(
    887     flags=flags, location=location, parents=parents, metadata=metadata
    888 )

File /opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/tskit/metadata.py:681, in MetadataSchema.validate_and_encode_row(self, row)
    679         self._validate_row(row)
    680     except jsonschema.exceptions.ValidationError as ve:
--> 681         raise exceptions.MetadataValidationError(str(ve)) from ve
    682 return self.encode_row(row)

MetadataValidationError: Additional properties are not allowed ('newKey' was unexpected)

Failed validating 'additionalProperties' in schema:
    OrderedDict([('additionalProperties', False),
                 ('codec', 'json'),
                 ('properties',
                  OrderedDict([('accession',
                                OrderedDict([('description',
                                              'ENA accession number'),
                                             ('type', 'string')])),
                               ('pcr',
                                OrderedDict([('description',
                                              'Was PCR used on this '
                                              'sample'),
                                             ('name', 'PCR Used'),
                                             ('type', 'boolean')]))])),
                 ('required', ['accession', 'pcr']),
                 ('type', 'object')])

On instance:
    {'accession': 'Bob1234', 'newKey': 25, 'pcr': True}

To set the top-level metadata, just assign it. Validation and encoding happen as specified by the top-level metadata schema

tables.metadata_schema = basic_schema  # Allows new fields to be added that are not validated
tables.metadata = {"mean_coverage": 200.5}
print(tables.metadata)
{'mean_coverage': 200.5}

Note

Provenance information, detailing the origin of the data, modification timestamps, and (ideally) how the tree sequence can be reconstructed, should go in Provenance, not metadata.

To modify a schema — for example to add a key — first get the dict representation, modify, then write back:

schema_dict = tables.individuals.metadata_schema.schema
schema_dict["properties"]["newKey"] = {"type": "integer"}
tables.individuals.metadata_schema = tskit.MetadataSchema(schema_dict)
# Now this will work:
new_id = tables.individuals.add_row(metadata={'accession': 'abc123', 'pcr': False, 'newKey': 25})
print(tables.individuals[new_id].metadata)
{'accession': 'abc123', 'newKey': 25, 'pcr': False}

To modify the metadata of rows in tables use the Metadata for bulk table methods.

Viewing raw metadata

If you need to see the raw (i.e. bytes) metadata, you just need to remove the schema, for instance:

individual_table = tables.individuals.copy()  # don't change the original tables.individual

print("Metadata:\n", individual_table[0].metadata)

individual_table.metadata_schema = tskit.MetadataSchema(None)
print("\nRaw metadata:\n", individual_table[0].metadata)
Metadata:
 {'accession': 'Bob1234', 'pcr': True}

Raw metadata:
 b'{"accession":"Bob1234","pcr":true}'

Metadata for bulk table methods

In the interests of efficiency each table’s packset_metadata() method, as well as the more general set_columns() and append_columns() methods, do not attempt to validate or encode metadata. You can call MetadataSchema.validate_and_encode_row() directly to prepare metadata for these methods:

metadata_column = [
    {"accession": "etho1234", "pcr": True},
    {"accession": "richard1235", "pcr": False},
    {"accession": "albert1236", "pcr": True},
]
encoded_metadata_column = [
    tables.individuals.metadata_schema.validate_and_encode_row(r) for r in metadata_column
]
md, md_offset = tskit.pack_bytes(encoded_metadata_column)
tables.individuals.set_columns(flags=[0, 0, 0], metadata=md, metadata_offset=md_offset)
tables.individuals
idflagslocationparentsmetadata
00{'accession': 'etho1234', 'pcr': True}
10{'accession': 'richard1235', 'pcr': F...
20{'accession': 'albert1236', 'pcr': True}

Or if all columns do not need to be set:

tables.individuals.packset_metadata(
    [tables.individuals.metadata_schema.validate_and_encode_row(r) for r in metadata_column]
)

Binary metadata

To disable the validation and encoding of metadata and store raw bytes pass None to MetadataSchema

tables.populations.metadata_schema = tskit.MetadataSchema(None)
tables.populations.add_row(metadata=b"SOME CUSTOM BYTES #!@")
print(tables.populations[0].metadata)
b'SOME CUSTOM BYTES #!@'