Hide code cell source

from IPython.display import display

Data export#

Variant Call Format#

Tskit supports exporting data to the standard Variant Call Format via the tskit vcf command line interface command and the TreeSequence.write_vcf() method in the Python API. Conversion is quite efficient, with tskit producing VCF data at several hundred megabytes per second (for large files), which is usually as fast as it can be written to storage or consumed by programs in a pipeline.

Tip

If we have a tree sequence file the command line interface is often the most convenient way to convert to VCF:

$ tskit vcf example.trees > example.vcf

See the Compressed output section for information on how to compress the VCF output.

For tree sequences produced by recent versions of programs such as msprime, SLiM, fwdpy11 or tsinfer, VCF output will “do the right thing” and no further arguments are needed. For example, here we simulate 3 diploid individuals with mutations using msprime, and convert to VCF.

import sys
import msprime
ts = msprime.sim_ancestry(
    samples=3, ploidy=2, sequence_length=10, random_seed=2)
ts = msprime.sim_mutations(ts, rate=0.1, random_seed=2)
ts.write_vcf(sys.stdout)
##fileformat=VCFv4.2
##source=tskit 1.0.0
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=10>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	tsk_0	tsk_1	tsk_2
1	2	0	A	T,G	.	PASS	.	GT	0|0	0|0	2|2
1	3	1	C	T	.	PASS	.	GT	0|0	0|1	0|0
1	4	2	G	C	.	PASS	.	GT	1|0	1|1	0|0
1	6	3	C	T	.	PASS	.	GT	1|0	1|1	0|0
1	9	4	A	C	.	PASS	.	GT	0|0	1|0	0|0

In the output VCF we have 3 diploid samples (see the Terminology section) corresponding to samples specified in the ancestry simulation with IDs tsk_0, tsk_1 and tsk_2 (see the Individual names section for how to override these default labels). We then have a line for every row in the site table, and the data is derived directly from the TreeSequence.variants() method; e.g.

for var in ts.variants():
    print(var.site.position, var.site.id, var.alleles, var.genotypes, sep="\t")
2.0	0	('A', 'T', 'G')	[0 0 0 0 2 2]
3.0	1	('C', 'T')	[0 0 0 1 0 0]
4.0	2	('G', 'C')	[1 0 1 1 0 0]
6.0	3	('C', 'T')	[1 0 1 1 0 0]
9.0	4	('A', 'C')	[0 0 1 0 0 0]

We can see the POS value is equal to the site’s position (see the Modifying coordinates for information on how we deal with continuous coordinates), the ID value is the site’s ID, and the REF and ALT values are derived from the variant’s alleles.

The GT values for the three diploid individuals are derived from the variant’s genotypes (see the Terminology section). For this simulation, the diploid individuals correspond to adjacent sample nodes in order, and we can see there is a direct correspondence between the phased GT values and variant’s genotypes. See the Constructing GT values section for more information on how this done in general and for options to control the VCF sample and GT values.

Important

In these examples we write the VCF data to sys.stdout so that we can see the output. Usually, however, you’d write to a file:

with open("output.vcf", "w") as vcf_file:
    ts.write_vcf(vcf_file)

See also

See the Compressed output section for information on how to compress the output or convert to BCF.

Terminology#

There are some mismatches between the terminology for tskit and VCF. In VCF a “sample” is a multiploid individual, but in tskit a sample refers to a single node (monoploid genome), and an individual consists of one or more nodes (e.g., two nodes for a diploid). Similarly, in VCF a “genotype” refers to the observed allelic state for a sample individual at a particular site, whereas in tskit a genotype is the observed allelic state for a node (see Variant.genotypes).

See also

See the Glossary for more details on tskit’s data model and terminology.

Compressed output#

The simplest way to compress the VCF output is to use the tskit vcf command line interface and pipe the output to bgzip:

$ tskit vcf example.trees | bgzip -c > example.vcf.gz

A general way to convert VCF data to various formats is to pipe the text produced by tskit into bcftools using the command line interface:

$ tskit vcf example.trees | bcftools view -O b > example.bcf

If you need more control over the form of the output (or want to work directly in Python), the following recipe has the same effect:

import os
import subprocess

read_fd, write_fd = os.pipe()
write_pipe = os.fdopen(write_fd, "w")
with open("output.bcf", "w") as bcf_file:
    proc = subprocess.Popen(
        ["bcftools", "view", "-O", "b"], stdin=read_fd, stdout=bcf_file
    )
    ts.write_vcf(write_pipe)
    write_pipe.close()
    os.close(read_fd)
    proc.wait()
    if proc.returncode != 0:
        raise RuntimeError("bcftools failed with status:", proc.returncode)

The VCF output can also be compressed using the gzip Python module:

import gzip

with gzip.open("output.vcf.gz", "wt") as f:
    ts.write_vcf(f)

However, this gzipped VCF won’t be fully compatible with downstream tools such as tabix, which usually require the VCF to use the specialised bgzip format.

Masking output#

The TreeSequence.write_vcf() method provides the site_mask and sample_mask arguments to omit or mark parts of the output as missing.

ts = msprime.sim_ancestry(
    samples=3, ploidy=2, sequence_length=10, random_seed=2)
ts = msprime.sim_mutations(ts, rate=0.1, random_seed=2)
ts.tables.sites
idpositionancestral_statemetadata
02A
13C
24G
36C
49A

The sample_mask argument provides a general way to mask out parts of the output, which can be helpful when simulating missing data. In this (contrived) example, we create a sample mask function that marks one genotype missing in each variant in a regular pattern:

def sample_mask(variant):
    sample_mask = np.zeros(ts.num_samples, dtype=bool)
    sample_mask[variant.site.id % ts.num_samples] = 1
    return sample_mask


ts.write_vcf(sys.stdout, sample_mask=sample_mask)

Constructing GT values#

The core elements of the tskit data model are nodes, edges, sites and mutations. These four tables allow us to completely describe the genetic ancestry of a set of sampled monoploid genomes and their genetic variation. The individual table defines a set of individual organisms, and it can be used to define the inheritance relationships between then (the pedigree). An individual may be associated with one or more nodes, and these nodes may or may not be samples (see the Glossary for clarification of these terms). Thus, there is some complexity in how the per-individual GT values are generated, which we explain in this section.

Without individuals#

We start with an example in which there are no individuals defined (which was the default in msprime before version 1.0):

import tskit
tables = tskit.Tree.generate_balanced(4, span=10).tree_sequence.dump_tables()
tables.sites.add_row(3, ancestral_state="A")
tables.mutations.add_row(site=0, node=0, derived_state="T")
ts = tables.tree_sequence()
display(ts.draw_svg())
display(ts)
ts.write_vcf(sys.stdout)
_images/23c63a0b5f86220de1f6b8199f5db399035f64cc6400411c78ce9f79bda8fafe.svg
Tree Sequence
Trees1
Sequence Length10
Time Unitsunknown
Sample Nodes4
Total Size1.1 KiB
MetadataNo Metadata
Table Rows Size Has Metadata
Edges 6 200 Bytes
Individuals 0 24 Bytes
Migrations 0 8 Bytes
Mutations 1 53 Bytes
Nodes 7 204 Bytes
Populations 0 8 Bytes
Provenances 1 509 Bytes
Sites 1 41 Bytes
Provenance Timestamp Software Name Version Command Full record
28 November, 2025 at 11:33:04 AM tskit 1.0.0 generate_balanced
Details
dict schema_version: 1.0.0
software:
dict name: tskit
version: 1.0.0

parameters:
dict command: generate_balanced
TODO: add parameters

environment:
dict
os:
dict system: Linux
node: runnervmg1sw1
release: 6.11.0-1018-azure
version: #18~24.04.1-Ubuntu SMP Sat Jun
28 04:46:03 UTC 2025
machine: x86_64

python:
dict implementation: CPython
version: 3.11.13

libraries:
dict
kastore:
dict version: 2.1.1



To cite this software, please consult the citation manual: https://tskit.dev/citation/
##fileformat=VCFv4.2
##source=tskit 1.0.0
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=10>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	tsk_0	tsk_1	tsk_2	tsk_3
1	3	0	A	T	.	PASS	.	GT	1	0	0	0

Here we define a tree sequence consisting of a single tree, which has a variant site at position 3 and a mutation over node 0. There is no information about individuals in this tree sequence, and so we assume that each of the nodes corresponds to a single haploid individual.

Users of msprime simulations would often be interested in producing VCFs for diploid organisms. Because of the assumptions made by these simulations, this means arbitrarily combining the sample nodes into pairs. This is what the ploidy option does:

ts.write_vcf(sys.stdout, ploidy=2)
##fileformat=VCFv4.2
##source=tskit 1.0.0
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=10>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	tsk_0	tsk_1
1	3	0	A	T	.	PASS	.	GT	1|0	0|0

Thus, the GT values for the (synthetic) diploid individual tsk_0 are generated by combining nodes 0 and 1, and tsk_1 by combining nodes 2 and 3.

Important

Software packages modelling multiploid individuals are encouraged to use the individual table to make their assumptions explicit. Recent versions of simulators and inference methods should all do this, and so the ploidy argument is really only intended to support legacy code. It is therefore an error to supply a value for ploidy when individual information is present in a tree sequence.

With individuals#

Extending the example in the previous section, we add some individual data defining a pair of diploid sibs and their parents.

Note

We set the nodes for (e.g.) individual 2 to [1, 3] here to illustrate that nodes for a given individual are not necessarily contiguous.

tables.individuals.add_row(parents=[-1, -1])
tables.individuals.add_row(parents=[-1, -1])
tables.individuals.add_row(parents=[0, 1])
tables.individuals.add_row(parents=[0, 1])
node_individual = tables.nodes.individual
node_individual[[1, 3]] = 2
node_individual[[0, 2]] = 3
tables.nodes.individual = node_individual
display(tables.individuals)
display(tables.nodes)
ts = tables.tree_sequence()
ts.write_vcf(sys.stdout)
idflagslocationparentsmetadata
00-1, -1
10-1, -1
200, 1
300, 1
idtimeflagspopulationindividualmetadata
001-13
101-12
201-13
301-12
410-1-1
510-1-1
620-1-1
##fileformat=VCFv4.2
##source=tskit 1.0.0
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=10>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	tsk_2	tsk_3
1	3	0	A	T	.	PASS	.	GT	0|0	1|0

In this model we have four individuals defined, but only individuals 2 and 3 are associated with nodes (more specifically, sample nodes). Thus, we output two VCF sample individuals composed of the linked nodes.

Note

Note that the labels are tsk_0 and tsk_1 even though the individual IDs are 2 and 3. See the Individual names section for how to change the these default labels.

If some individuals have no associated nodes, they are omitted from the VCF output. By default, only nodes that are marked as samples contribute to the VCF genotypes; to include non-sample nodes as well (e.g., internal nodes that have been marked as individuals), set include_non_sample_nodes=True when calling :meth:TreeSequence.write_vcf.

Note

At present, :meth:TreeSequence.write_vcf only supports sites with up to 9 distinct alleles; attempting to write a site with more than 9 alleles will result in a :class:ValueError.

Individual names#

By default the VCF samples are given the labels tsk_0, tsk_1, …, tsk_{N - 1}, where N is the number of individuals to be output (see the Constructing GT values section).

We can change this default labelling using the individual_names argument::

import sys
import msprime
ts = msprime.sim_ancestry(
    samples=3, ploidy=2, sequence_length=10, random_seed=2)
ts = msprime.sim_mutations(ts, rate=0.1, random_seed=2)
ts.write_vcf(sys.stdout, individual_names=["A", "B", "C"])
##fileformat=VCFv4.2
##source=tskit 1.0.0
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=10>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	A	B	C
1	2	0	A	T,G	.	PASS	.	GT	0|0	0|0	2|2
1	3	1	C	T	.	PASS	.	GT	0|0	0|1	0|0
1	4	2	G	C	.	PASS	.	GT	1|0	1|1	0|0
1	6	3	C	T	.	PASS	.	GT	1|0	1|1	0|0
1	9	4	A	C	.	PASS	.	GT	0|0	1|0	0|0

Modifying coordinates#

Tree sequence site positions can be floating point values, whereas VCF requires positive integers. The position_transform argument controls how tskit maps coordinates into VCF. Translating non-integer positions necessarily loses precision; by default we round to the nearest integer, so multiple sites may share the same output position. Furthermore, tskit’s coordinate system starts at zero, whereas the VCF standard requires positions to be positive, and so a site at position 0 is not valid in the VCF standard. Because VCF parsers differ, we do not do anything to account for this.

The simplest resolution of this discrepancy in convention between tskit and VCF positions is deal with any site at position 0 as a special case (for instance, by discarding or ignoring it). A different interpretation of this difference between tskit’s position and VCF’s POS field is that they are different coordinate systems: tskit coordinates are “distance to the right of the left end of the chromosome”, while VCF coordinates are “which number site, counting from the left end of the chromosome and starting at one”. Under this interpretation, the solution is to supply an explicit position_transform that adds 1 to the coordinate when outputting to VCF (or, using the "legacy" option described below). However, this can easily lead to off-by-one errors converting between the coordinate systems, so should only be used if you really are using 0-based coordinates for your tree sequence.

Warning

Most VCF tools cannot deal with a POS value of 0. If your tree sequence contains a site with position 0, this will likely cause an error.

Internally, the coordinates used in the VCF output are obtained by applying the position_transform function to the array of site positions (and, for the contig length, to the tree sequence :attr:.TreeSequence.sequence_length). This function must return a one-dimensional array of the same length as its input; otherwise a :class:ValueError is raised. In addition to accepting a callable, tskit also supports the string value "legacy" here, which selects the pre-0.2.0 behaviour used by the original VCF exporter: positions are rounded to the nearest integer, starting at 1, and are forced to be strictly increasing by incrementing ties.

The VCF specification does not allow positions to be 0. By default, if any transformed position is 0, :meth:TreeSequence.write_vcf will raise a :class:ValueError. If you wish to retain these records you can either:

  • set allow_position_zero=True to write such sites anyway;

  • mask the offending sites using the site_mask argument; or

  • choose a position_transform that maps 0 to a valid positive position.

For example, to shift all coordinates by 1, we could define:

def one_based_positions(positions):
    return [int(round(x)) + 1 for x in positions]

ts.write_vcf(sys.stdout, position_transform=one_based_positions)
##fileformat=VCFv4.2
##source=tskit 1.0.0
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=11>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	tsk_0	tsk_1	tsk_2
1	3	0	A	T,G	.	PASS	.	GT	0|0	0|0	2|2
1	4	1	C	T	.	PASS	.	GT	0|0	0|1	0|0
1	5	2	G	C	.	PASS	.	GT	1|0	1|1	0|0
1	7	3	C	T	.	PASS	.	GT	1|0	1|1	0|0
1	10	4	A	C	.	PASS	.	GT	0|0	1|0	0|0

Note

The msprime 0.x legacy API simulates using continuous coordinates. It may be simpler to update your code to use the msprime 1.0 API (which uses discrete coordinates by default) than to work out how to transform coordinates in a way that is suitable for your application.