Data export#
Variant Call Format#
Tskit supports exporting data to the standard
Variant Call Format
via the tskit vcf command line interface command
and the TreeSequence.write_vcf() method in the Python API.
Conversion is quite efficient, with tskit producing VCF data at several
hundred megabytes per second (for large files), which is usually as fast as
it can be written to storage or consumed by programs in a pipeline.
Tip
If we have a tree sequence file the command line interface is often the most convenient way to convert to VCF:
$ tskit vcf example.trees > example.vcf
See the Compressed output section for information on how to compress the VCF output.
For tree sequences produced by recent versions of programs such as
msprime, SLiM, fwdpy11 or tsinfer, VCF output will
“do the right thing” and no further arguments are needed.
For example, here we simulate 3 diploid individuals
with mutations using msprime, and convert to VCF.
import sys
import msprime
ts = msprime.sim_ancestry(
samples=3, ploidy=2, sequence_length=10, random_seed=2)
ts = msprime.sim_mutations(ts, rate=0.1, random_seed=2)
ts.write_vcf(sys.stdout)
##fileformat=VCFv4.2
##source=tskit 1.0.0
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=10>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT tsk_0 tsk_1 tsk_2
1 2 0 A T,G . PASS . GT 0|0 0|0 2|2
1 3 1 C T . PASS . GT 0|0 0|1 0|0
1 4 2 G C . PASS . GT 1|0 1|1 0|0
1 6 3 C T . PASS . GT 1|0 1|1 0|0
1 9 4 A C . PASS . GT 0|0 1|0 0|0
In the output VCF we have 3 diploid samples
(see the Terminology section)
corresponding to samples specified in the ancestry simulation
with IDs tsk_0, tsk_1 and tsk_2
(see the Individual names
section for how to override these default labels).
We then have a line for every row
in the site table, and
the data is derived directly from the TreeSequence.variants()
method; e.g.
for var in ts.variants():
print(var.site.position, var.site.id, var.alleles, var.genotypes, sep="\t")
2.0 0 ('A', 'T', 'G') [0 0 0 0 2 2]
3.0 1 ('C', 'T') [0 0 0 1 0 0]
4.0 2 ('G', 'C') [1 0 1 1 0 0]
6.0 3 ('C', 'T') [1 0 1 1 0 0]
9.0 4 ('A', 'C') [0 0 1 0 0 0]
We can see the POS value is equal to the site’s position
(see the Modifying coordinates for information
on how we deal with continuous coordinates), the ID value
is the site’s ID, and the REF and ALT values
are derived from the variant’s alleles.
The GT values for the three diploid individuals are derived from the
variant’s genotypes (see the Terminology section).
For this simulation, the diploid individuals correspond to
adjacent sample nodes in order, and we can see there is a direct
correspondence between the phased GT values and variant’s genotypes.
See the Constructing GT values section for
more information on how this done in general and for options
to control the VCF sample and GT values.
Important
In these examples we write the VCF data to sys.stdout so that we can see
the output. Usually, however, you’d write to a file:
with open("output.vcf", "w") as vcf_file:
ts.write_vcf(vcf_file)
See also
See the Compressed output section for information on how to compress the output or convert to BCF.
Terminology#
There are some mismatches between the terminology for tskit and VCF.
In VCF a “sample” is a multiploid individual, but in tskit a sample
refers to a single node (monoploid genome), and an individual
consists of one or more nodes (e.g., two nodes for a diploid).
Similarly, in VCF a “genotype” refers to the observed allelic state
for a sample individual at a particular site,
whereas in tskit a genotype is the observed allelic state
for a node (see Variant.genotypes).
See also
See the Glossary for more details on tskit’s data model and terminology.
Compressed output#
The simplest way to compress the VCF output is to use the
tskit vcf command line interface
and pipe the output to bgzip:
$ tskit vcf example.trees | bgzip -c > example.vcf.gz
A general way to convert VCF data to various formats is to pipe the text
produced by tskit into bcftools using the command
line interface:
$ tskit vcf example.trees | bcftools view -O b > example.bcf
If you need more control over the form of the output (or want to work directly in Python), the following recipe has the same effect:
import os
import subprocess
read_fd, write_fd = os.pipe()
write_pipe = os.fdopen(write_fd, "w")
with open("output.bcf", "w") as bcf_file:
proc = subprocess.Popen(
["bcftools", "view", "-O", "b"], stdin=read_fd, stdout=bcf_file
)
ts.write_vcf(write_pipe)
write_pipe.close()
os.close(read_fd)
proc.wait()
if proc.returncode != 0:
raise RuntimeError("bcftools failed with status:", proc.returncode)
The VCF output can also be compressed using the gzip Python module:
import gzip
with gzip.open("output.vcf.gz", "wt") as f:
ts.write_vcf(f)
However, this gzipped VCF won’t be fully compatible with downstream tools such as tabix, which usually require the VCF to use the specialised bgzip format.
Masking output#
The TreeSequence.write_vcf() method provides the
site_mask and sample_mask arguments to
omit or mark parts of the output as missing.
ts = msprime.sim_ancestry(
samples=3, ploidy=2, sequence_length=10, random_seed=2)
ts = msprime.sim_mutations(ts, rate=0.1, random_seed=2)
ts.tables.sites
| id | position | ancestral_state | metadata |
|---|---|---|---|
| 0 | 2 | A | |
| 1 | 3 | C | |
| 2 | 4 | G | |
| 3 | 6 | C | |
| 4 | 9 | A |
The sample_mask argument provides a general way to mask out
parts of the output, which can be helpful when simulating missing
data. In this (contrived) example, we create a sample mask function
that marks one genotype missing in each variant in a regular
pattern:
def sample_mask(variant):
sample_mask = np.zeros(ts.num_samples, dtype=bool)
sample_mask[variant.site.id % ts.num_samples] = 1
return sample_mask
ts.write_vcf(sys.stdout, sample_mask=sample_mask)
Constructing GT values#
The core elements of the tskit data model are nodes, edges, sites and mutations. These four tables allow us to completely describe the genetic ancestry of a set of sampled monoploid genomes and their genetic variation. The individual table defines a set of individual organisms, and it can be used to define the inheritance relationships between then (the pedigree). An individual may be associated with one or more nodes, and these nodes may or may not be samples (see the Glossary for clarification of these terms). Thus, there is some complexity in how the per-individual GT values are generated, which we explain in this section.
Without individuals#
We start with an example in which there are no individuals defined (which was the default in msprime before version 1.0):
import tskit
tables = tskit.Tree.generate_balanced(4, span=10).tree_sequence.dump_tables()
tables.sites.add_row(3, ancestral_state="A")
tables.mutations.add_row(site=0, node=0, derived_state="T")
ts = tables.tree_sequence()
display(ts.draw_svg())
display(ts)
ts.write_vcf(sys.stdout)
|
|
|
|---|---|
| Trees | 1 |
| Sequence Length | 10 |
| Time Units | unknown |
| Sample Nodes | 4 |
| Total Size | 1.1 KiB |
| Metadata | No Metadata |
| Table | Rows | Size | Has Metadata |
|---|---|---|---|
| Edges | 6 | 200 Bytes | |
| Individuals | 0 | 24 Bytes | |
| Migrations | 0 | 8 Bytes | |
| Mutations | 1 | 53 Bytes | |
| Nodes | 7 | 204 Bytes | |
| Populations | 0 | 8 Bytes | |
| Provenances | 1 | 509 Bytes | |
| Sites | 1 | 41 Bytes |
| Provenance Timestamp | Software Name | Version | Command | Full record |
|---|---|---|---|---|
| 28 November, 2025 at 11:33:04 AM | tskit | 1.0.0 | generate_balanced |
Detailsdictschema_version: 1.0.0
software:
dictname: tskitversion: 1.0.0
parameters:
dictcommand: generate_balancedTODO: add parameters
environment:
dict
os:
dictsystem: Linuxnode: runnervmg1sw1 release: 6.11.0-1018-azure version: #18~24.04.1-Ubuntu SMP Sat Jun 28 04:46:03 UTC 2025 machine: x86_64
python:
dictimplementation: CPythonversion: 3.11.13
libraries:
dict
kastore:
dictversion: 2.1.1 |
##fileformat=VCFv4.2
##source=tskit 1.0.0
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=10>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT tsk_0 tsk_1 tsk_2 tsk_3
1 3 0 A T . PASS . GT 1 0 0 0
Here we define a tree sequence consisting of a single tree, which has a variant site at position 3 and a mutation over node 0. There is no information about individuals in this tree sequence, and so we assume that each of the nodes corresponds to a single haploid individual.
Users of msprime simulations would often be interested in producing
VCFs for diploid organisms. Because of the assumptions made
by these simulations, this means arbitrarily combining the sample
nodes into pairs. This is what the ploidy option does:
ts.write_vcf(sys.stdout, ploidy=2)
##fileformat=VCFv4.2
##source=tskit 1.0.0
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=10>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT tsk_0 tsk_1
1 3 0 A T . PASS . GT 1|0 0|0
Thus, the GT values for the (synthetic) diploid individual tsk_0
are generated by combining nodes 0 and 1, and tsk_1
by combining nodes 2 and 3.
Important
Software packages modelling multiploid individuals are encouraged to
use the individual table to make their assumptions explicit. Recent
versions of simulators and inference methods should all do this,
and so the ploidy argument is really only intended to support
legacy code. It is therefore an error to supply a value for ploidy
when individual information is present in a tree sequence.
With individuals#
Extending the example in the previous section, we add some individual data defining a pair of diploid sibs and their parents.
Note
We set the nodes for (e.g.) individual 2 to [1, 3] here to illustrate that nodes for a given individual are not necessarily contiguous.
tables.individuals.add_row(parents=[-1, -1])
tables.individuals.add_row(parents=[-1, -1])
tables.individuals.add_row(parents=[0, 1])
tables.individuals.add_row(parents=[0, 1])
node_individual = tables.nodes.individual
node_individual[[1, 3]] = 2
node_individual[[0, 2]] = 3
tables.nodes.individual = node_individual
display(tables.individuals)
display(tables.nodes)
ts = tables.tree_sequence()
ts.write_vcf(sys.stdout)
| id | flags | location | parents | metadata |
|---|---|---|---|---|
| 0 | 0 | -1, -1 | ||
| 1 | 0 | -1, -1 | ||
| 2 | 0 | 0, 1 | ||
| 3 | 0 | 0, 1 |
| id | time | flags | population | individual | metadata |
|---|---|---|---|---|---|
| 0 | 0 | 1 | -1 | 3 | |
| 1 | 0 | 1 | -1 | 2 | |
| 2 | 0 | 1 | -1 | 3 | |
| 3 | 0 | 1 | -1 | 2 | |
| 4 | 1 | 0 | -1 | -1 | |
| 5 | 1 | 0 | -1 | -1 | |
| 6 | 2 | 0 | -1 | -1 |
##fileformat=VCFv4.2
##source=tskit 1.0.0
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=10>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT tsk_2 tsk_3
1 3 0 A T . PASS . GT 0|0 1|0
In this model we have four individuals defined, but only individuals 2 and 3 are associated with nodes (more specifically, sample nodes). Thus, we output two VCF sample individuals composed of the linked nodes.
Note
Note that the labels are tsk_0 and tsk_1 even though
the individual IDs are 2 and 3. See the
Individual names section for how to change the
these default labels.
If some individuals have no associated nodes, they are omitted from the
VCF output. By default, only nodes that are marked as samples contribute
to the VCF genotypes; to include non-sample nodes as well (e.g., internal
nodes that have been marked as individuals), set
include_non_sample_nodes=True when calling :meth:TreeSequence.write_vcf.
Note
At present, :meth:TreeSequence.write_vcf only supports sites with up to
9 distinct alleles; attempting to write a site with more than 9 alleles
will result in a :class:ValueError.
Individual names#
By default the VCF samples are given the labels tsk_0, tsk_1,
…, tsk_{N - 1}, where N is the number of individuals to
be output (see the Constructing GT values section).
We can change this default labelling using the individual_names
argument::
import sys
import msprime
ts = msprime.sim_ancestry(
samples=3, ploidy=2, sequence_length=10, random_seed=2)
ts = msprime.sim_mutations(ts, rate=0.1, random_seed=2)
ts.write_vcf(sys.stdout, individual_names=["A", "B", "C"])
##fileformat=VCFv4.2
##source=tskit 1.0.0
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=10>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT A B C
1 2 0 A T,G . PASS . GT 0|0 0|0 2|2
1 3 1 C T . PASS . GT 0|0 0|1 0|0
1 4 2 G C . PASS . GT 1|0 1|1 0|0
1 6 3 C T . PASS . GT 1|0 1|1 0|0
1 9 4 A C . PASS . GT 0|0 1|0 0|0
Exporting to plink#
The default VCF sample IDs produced by tskit do not work well
with plink because it parses the individual
IDs based on a particular format, and does not allow 0 as a valid
identifier. We get an error like this:
Error: Sample ID ends with "_0", which induces an invalid IID of '0`.
This can be fixed by using the individual_names argument
to set the names to anything where the first name doesn’t end with _0.
An example implementation for diploid individuals is:
n_dip_indv = int(ts.num_samples / 2)
indv_names = [f"tsk_{i}indv" for i in range(n_dip_indv)]
with open("output.vcf", "w") as vcf_file:
ts.write_vcf(vcf_file, individual_names=indv_names)
Adding a second _ (eg: tsk_0_indv) is not recommended as
plink uses _ as the default separator for separating family
id and individual id, and two underscores will throw an error.
Modifying coordinates#
Tree sequence site positions can be floating point values, whereas VCF
requires positive integers. The position_transform argument
controls how tskit maps coordinates into VCF. Translating non-integer
positions necessarily loses precision; by default we round to the nearest
integer, so multiple sites may share the same output position.
Furthermore, tskit’s coordinate system starts at zero,
whereas the VCF standard requires positions to be positive,
and so a site at position 0 is not valid in the VCF standard.
Because VCF parsers differ, we do not do anything to account for this.
The simplest resolution of this discrepancy in convention between tskit and VCF
positions is deal with any site at position 0 as a special case (for instance,
by discarding or ignoring it).
A different interpretation of this difference between tskit’s position
and VCF’s POS field
is that they are different coordinate systems: tskit coordinates are
“distance to the right of the left end of the chromosome”,
while VCF coordinates are “which number site, counting from the left end
of the chromosome and starting at one”.
Under this interpretation, the solution is to supply an explicit
position_transform that adds 1 to the coordinate when outputting
to VCF (or, using the "legacy" option described below). However, this can
easily lead to off-by-one errors converting between the coordinate systems,
so should only be used if you really are using 0-based coordinates for your
tree sequence.
Warning
Most VCF tools cannot deal with a POS value of 0. If your tree sequence contains a site with position 0, this will likely cause an error.
Internally, the coordinates used in the VCF output are obtained by applying
the position_transform function to the array of site positions (and, for
the contig length, to the tree sequence :attr:.TreeSequence.sequence_length).
This function must return a one-dimensional array of the same length as its
input; otherwise a :class:ValueError is raised. In addition to accepting a
callable, tskit also supports the string value "legacy" here, which
selects the pre-0.2.0 behaviour used by the original VCF exporter:
positions are rounded to the nearest integer, starting at 1, and are forced
to be strictly increasing by incrementing ties.
The VCF specification does not allow positions to be 0. By default, if any
transformed position is 0, :meth:TreeSequence.write_vcf will raise a
:class:ValueError. If you wish to retain these records you can either:
set
allow_position_zero=Trueto write such sites anyway;mask the offending sites using the
site_maskargument; orchoose a
position_transformthat maps 0 to a valid positive position.
For example, to shift all coordinates by 1, we could define:
def one_based_positions(positions):
return [int(round(x)) + 1 for x in positions]
ts.write_vcf(sys.stdout, position_transform=one_based_positions)
##fileformat=VCFv4.2
##source=tskit 1.0.0
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=11>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT tsk_0 tsk_1 tsk_2
1 3 0 A T,G . PASS . GT 0|0 0|0 2|2
1 4 1 C T . PASS . GT 0|0 0|1 0|0
1 5 2 G C . PASS . GT 1|0 1|1 0|0
1 7 3 C T . PASS . GT 1|0 1|1 0|0
1 10 4 A C . PASS . GT 0|0 1|0 0|0
Note
The msprime 0.x legacy API simulates using continuous coordinates. It may be simpler to update your code to use the msprime 1.0 API (which uses discrete coordinates by default) than to work out how to transform coordinates in a way that is suitable for your application.