Show code cell source
from IPython.display import display
Data export#
Variant Call Format#
Tskit supports exporting data to the standard
Variant Call Format
via the tskit vcf
command line interface command
and the TreeSequence.write_vcf()
method in the Python API.
Conversion is quite efficient, with tskit producing VCF data at several
hundred megabytes per second (for large files), which is usually as fast as
it can be written to storage or consumed by programs in a pipeline.
Tip
If we have a tree sequence file the command line interface is often the most convenient way to convert to VCF:
$ tskit vcf example.trees > example.vcf
See the Compressed output section for information on how to compress the VCF output.
For tree sequences produced by recent versions of programs such as
msprime
, SLiM
, fwdpy11
or tsinfer
, VCF output will
“do the right thing” and no further arguments are needed.
For example, here we simulate 3 diploid individuals
with mutations using msprime
, and convert to VCF.
import sys
import msprime
ts = msprime.sim_ancestry(
samples=3, ploidy=2, sequence_length=10, random_seed=2)
ts = msprime.sim_mutations(ts, rate=0.1, random_seed=2)
ts.write_vcf(sys.stdout)
##fileformat=VCFv4.2
##source=tskit 0.6.1.dev0
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=10>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT tsk_0 tsk_1 tsk_2
1 2 0 A T,G . PASS . GT 0|0 0|0 2|2
1 3 1 C T . PASS . GT 0|0 0|1 0|0
1 4 2 G C . PASS . GT 1|0 1|1 0|0
1 6 3 C T . PASS . GT 1|0 1|1 0|0
1 9 4 A C . PASS . GT 0|0 1|0 0|0
In the output VCF we have 3 diploid samples
(see the Terminology section)
corresponding to samples specified in the ancestry simulation
with IDs tsk_0
, tsk_1
and tsk_2
(see the Individual names
section for how to override these default labels).
We then have a line for every row
in the site table, and
the data is derived directly from the TreeSequence.variants()
method; e.g.
for var in ts.variants():
print(var.site.position, var.site.id, var.alleles, var.genotypes, sep="\t")
2.0 0 ('A', 'T', 'G') [0 0 0 0 2 2]
3.0 1 ('C', 'T') [0 0 0 1 0 0]
4.0 2 ('G', 'C') [1 0 1 1 0 0]
6.0 3 ('C', 'T') [1 0 1 1 0 0]
9.0 4 ('A', 'C') [0 0 1 0 0 0]
We can see the POS
value is equal to the site’s position
(see the Modifying coordinates for information
on how we deal with continuous coordinates), the ID
value
is the site’s ID, and the REF
and ALT
values
are derived from the variant’s alleles
.
The GT
values for the three diploid individuals are derived from the
variant’s genotypes (see the Terminology section).
For this simulation, the diploid individuals correspond to
adjacent sample nodes in order, and we can see there is a direct
correspondence between the phased GT
values and variant’s genotypes.
See the Constructing GT values section for
more information on how this done in general and for options
to control the VCF sample and GT
values.
Important
In these examples we write the VCF data to sys.stdout
so that we can see
the output. Usually, however, you’d write to a file:
with open("output.vcf", "w") as vcf_file:
ts.write_vcf(vcf_file)
See also
See the Compressed output section for information on how to compress the output or convert to BCF.
Terminology#
There are some mismatches between the terminology for tskit and VCF.
In VCF a “sample” is a multiploid individual, but in tskit a sample
refers to a single node (monoploid genome), and an individual
consists of one or more nodes (e.g., two nodes for a diploid).
Similarly, in VCF a “genotype” refers to the observed allelic state
for a sample individual at a particular site,
whereas in tskit a genotype is the observed allelic state
for a node (see Variant.genotypes
).
See also
See the Glossary for more details on tskit’s data model and terminology.
Compressed output#
The simplest way to compress the VCF output is to use the
tskit vcf
command line interface
and pipe the output to bgzip
:
$ tskit vcf example.trees | bgzip -c > example.vcf.gz
A general way to convert VCF data to various formats is to pipe the text
produced by tskit
into bcftools
using the command
line interface:
$ tskit vcf example.trees | bcftools view -O b > example.bcf
If you need more control over the form of the output (or want to work directly in Python), the following recipe has the same effect:
import os
import subprocess
read_fd, write_fd = os.pipe()
write_pipe = os.fdopen(write_fd, "w")
with open("output.bcf", "w") as bcf_file:
proc = subprocess.Popen(
["bcftools", "view", "-O", "b"], stdin=read_fd, stdout=bcf_file
)
ts.write_vcf(write_pipe)
write_pipe.close()
os.close(read_fd)
proc.wait()
if proc.returncode != 0:
raise RuntimeError("bcftools failed with status:", proc.returncode)
The VCF output can also be compressed using the gzip
Python module:
import gzip
with gzip.open("output.vcf.gz", "wt") as f:
ts.write_vcf(f)
However, this gzipped VCF won’t be fully compatible with downstream tools such as tabix, which usually require the VCF to use the specialised bgzip format.
Masking output#
The TreeSequence.write_vcf()
method provides the
site_mask
and sample_mask
arguments to
omit or mark parts of the output as missing.
ts = msprime.sim_ancestry(
samples=3, ploidy=2, sequence_length=10, random_seed=2)
ts = msprime.sim_mutations(ts, rate=0.1, random_seed=2)
ts.tables.sites
id | position | ancestral_state | metadata |
---|---|---|---|
0 | 2 | A | |
1 | 3 | C | |
2 | 4 | G | |
3 | 6 | C | |
4 | 9 | A |
The sample_mask
argument provides a general way to mask out
parts of the output, which can be helpful when simulating missing
data. In this (contrived) example, we create a sample mask function
that marks one genotype missing in each variant in a regular
pattern:
def sample_mask(variant):
sample_mask = np.zeros(ts.num_samples, dtype=bool)
sample_mask[variant.site.id % ts.num_samples] = 1
return sample_mask
ts.write_vcf(sys.stdout, sample_mask=sample_mask)
Constructing GT values#
The core elements of the tskit data model are nodes, edges, sites and mutations. These four tables allow us to completely describe the genetic ancestry of a set of sampled monoploid genomes and their genetic variation. The individual table defines a set of individual organisms, and it can be used to define the inheritance relationships between then (the pedigree). An individual may be associated with one or more nodes, and these nodes may or may not be samples (see the Glossary for clarification of these terms). Thus, there is some complexity in how the per-individual GT values are generated, which we explain in this section.
Without individuals#
We start with an example in which there are no individuals defined (which was the default in msprime before version 1.0):
import tskit
tables = tskit.Tree.generate_balanced(4, span=10).tree_sequence.dump_tables()
tables.sites.add_row(3, ancestral_state="A")
tables.mutations.add_row(site=0, node=0, derived_state="T")
ts = tables.tree_sequence()
display(ts.draw_svg())
display(ts)
ts.write_vcf(sys.stdout)
Tree Sequence | |
---|---|
Trees | 1 |
Sequence Length | 10.0 |
Time Units | unknown |
Sample Nodes | 4 |
Total Size | 1.1 KiB |
Metadata | No Metadata |
Table | Rows | Size | Has Metadata |
---|---|---|---|
Edges | 6 | 200 Bytes | |
Individuals | 0 | 24 Bytes | |
Migrations | 0 | 8 Bytes | |
Mutations | 1 | 53 Bytes | |
Nodes | 7 | 204 Bytes | |
Populations | 0 | 8 Bytes | |
Provenances | 1 | 504 Bytes | |
Sites | 1 | 41 Bytes |
Provenance Timestamp | Software Name | Version | Command | Full record |
---|---|---|---|---|
12 December, 2024 at 01:41:31 AM | tskit | 0.6.1.dev0 | generate_balanced |
Detailsdictschema_version: 1.0.0
software:
dictname: tskitversion: 0.6.1.dev0
parameters:
dictcommand: generate_balancedTODO: add parameters
environment:
dict
os:
dictsystem: Linuxnode: fv-az1914-72 release: 6.8.0-1017-azure version: #20-Ubuntu SMP Tue Oct 22 03:43:13 UTC 2024 machine: x86_64
python:
dictimplementation: CPythonversion: 3.10.15
libraries:
dict
kastore:
dictversion: 2.1.1 |
##fileformat=VCFv4.2
##source=tskit 0.6.1.dev0
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=10>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT tsk_0 tsk_1 tsk_2 tsk_3
1 3 0 A T . PASS . GT 1 0 0 0
Here we define a tree sequence consisting of a single tree, which has a variant site at position 3 and a mutation over node 0. There is no information about individuals in this tree sequence, and so we assume that each of the nodes corresponds to a single haploid individual.
Users of msprime simulations would often be interested in producing
VCFs for diploid organisms. Because of the assumptions made
by these simulations, this means arbitrarily combining the sample
nodes into pairs. This is what the ploidy
option does:
ts.write_vcf(sys.stdout, ploidy=2)
##fileformat=VCFv4.2
##source=tskit 0.6.1.dev0
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=10>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT tsk_0 tsk_1
1 3 0 A T . PASS . GT 1|0 0|0
Thus, the GT
values for the (synthetic) diploid individual tsk_0
are generated by combining nodes 0 and 1, and tsk_1
by combining nodes 2 and 3.
Important
Software packages modelling multiploid individuals are encouraged to
use the individual table to make their assumptions explicit. Recent
versions of simulators and inference methods should all do this,
and so the ploidy
argument is really only intended to support
legacy code. It is therefore an error to supply a value for ploidy
when individual information is present in a tree sequence.
With individuals#
Extending the example in the previous section, we add some individual data defining a pair of diploid sibs and their parents.
Note
We set the nodes for (e.g.) individual 2 to [1, 3] here to illustrate that nodes for a given individual are not necessarily contiguous.
tables.individuals.add_row(parents=[-1, -1])
tables.individuals.add_row(parents=[-1, -1])
tables.individuals.add_row(parents=[0, 1])
tables.individuals.add_row(parents=[0, 1])
node_individual = tables.nodes.individual
node_individual[[1, 3]] = 2
node_individual[[0, 2]] = 3
tables.nodes.individual = node_individual
display(tables.individuals)
display(tables.nodes)
ts = tables.tree_sequence()
ts.write_vcf(sys.stdout)
id | flags | location | parents | metadata |
---|---|---|---|---|
0 | 0 | -1, -1 | ||
1 | 0 | -1, -1 | ||
2 | 0 | 0, 1 | ||
3 | 0 | 0, 1 |
id | flags | population | individual | time | metadata |
---|---|---|---|---|---|
0 | 1 | -1 | 3 | 0 | |
1 | 1 | -1 | 2 | 0 | |
2 | 1 | -1 | 3 | 0 | |
3 | 1 | -1 | 2 | 0 | |
4 | 0 | -1 | -1 | 1 | |
5 | 0 | -1 | -1 | 1 | |
6 | 0 | -1 | -1 | 2 |
##fileformat=VCFv4.2
##source=tskit 0.6.1.dev0
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=10>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT tsk_0 tsk_1
1 3 0 A T . PASS . GT 0|0 1|0
In this model we have four individuals defined, but only individuals 2 and 3 are associated with nodes (more specifically, sample nodes). Thus, we output two VCF sample individuals composed of the linked nodes.
Note
Note that the labels are tsk_0
and tsk_1
even though
the individual IDs are 2 and 3. See the
Individual names section for how to change the
these default labels.
Individual names#
By default the VCF samples are given the labels tsk_0
, tsk_1
,
…, tsk_{N - 1}
, where N
is the number of individuals to
be output (see the Constructing GT values section).
We can change this default labelling using the individual_names
argument::
import sys
import msprime
ts = msprime.sim_ancestry(
samples=3, ploidy=2, sequence_length=10, random_seed=2)
ts = msprime.sim_mutations(ts, rate=0.1, random_seed=2)
ts.write_vcf(sys.stdout, individual_names=["A", "B", "C"])
##fileformat=VCFv4.2
##source=tskit 0.6.1.dev0
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=10>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT A B C
1 2 0 A T,G . PASS . GT 0|0 0|0 2|2
1 3 1 C T . PASS . GT 0|0 0|1 0|0
1 4 2 G C . PASS . GT 1|0 1|1 0|0
1 6 3 C T . PASS . GT 1|0 1|1 0|0
1 9 4 A C . PASS . GT 0|0 1|0 0|0
Exporting to plink#
The default VCF sample IDs produced by tskit
do not work well
with plink because it parses the individual
IDs based on a particular format, and does not allow 0
as a valid
identifier. We get an error like this:
Error: Sample ID ends with "_0", which induces an invalid IID of '0`.
This can be fixed by using the individual_names
argument
to set the names to anything where the first name doesn’t end with _0
.
An example implementation for diploid individuals is:
n_dip_indv = int(ts.num_samples / 2)
indv_names = [f"tsk_{i}indv" for i in range(n_dip_indv)]
with open("output.vcf", "w") as vcf_file:
ts.write_vcf(vcf_file, individual_names=indv_names)
Adding a second _
(eg: tsk_0_indv
) is not recommended as
plink
uses _
as the default separator for separating family
id and individual id, and two underscores will throw an error.
Modifying coordinates#
Tskit supports continuous genome coordinates, but VCF only supports discrete genome positions. Thus, when we convert a tree sequence that has sites at continuous positions we must discretise the coordinates in some way.
The position_transform
argument provides a way to flexibly translate
the genomic location of sites in tskit to the appropriate value in VCF.
There are two fundamental differences in the way that tskit and VCF define
genomic coordinates. The first is that tskit uses floating point values
to encode positions, whereas VCF uses integers. Thus, if the tree sequence
contains positions at non-integral locations there is an information loss
incurred by translating to VCF. By default, we round the site positions
to the nearest integer, such that there may be several sites with the
same integer position in the output. The second difference between VCF
and tskit is that VCF is defined to be a 1-based coordinate system, whereas
tskit uses 0-based. However, how coordinates are transformed depends
on the VCF parser, and so we do not account for this change in
coordinate system by default.
Note
The msprime 0.x legacy API simulates using continuous coordinates. It may be simpler to update your code to use the msprime 1.0 API (which uses discrete coordinates by default) than to work out how to transform coordinates in a way that is suitable for your application.
Todo
Provide some examples of using position transform.