Quickstart

Quickstart#

Tsinfer infers tree sequences from phased genetic variation data. Input data is stored in VCF Zarr (.vcz) format, and the pipeline is controlled by a TOML configuration file.

The typical workflow is:

Convert a bgzipped VCF to VCZ format
Write a TOML config describing inputs, outputs, and parameters
Run the pipeline via the tsinfer CLI
Analyse the resulting tree sequence with tskit

Preparing input data#

Tsinfer reads phased genotype data from .vcz stores. If you have a bgzipped, indexed VCF, convert it using vcf2zarr:

vcf2zarr convert mydata.vcf.gz mydata.vcz

Each site used for inference requires a known ancestral allele. If your VCF has an AA INFO field, vcf2zarr stores it as variant_AA in the .vcz store and you can reference it directly in the config. Alternatively, ancestral alleles can come from a separate VCZ store. For simulated data where the REF allele is the ancestral allele, set is_reference = true instead of specifying a field. See the config reference for details.

Writing the config#

The TOML config tells tsinfer where to find inputs, where to write outputs, and what parameters to use. Here is a minimal example:

[[source]]
name = "mydata"
path = "mydata.vcz"

[ancestral_state]
path = "mydata.vcz"
field = "variant_AA"

[[ancestors]]
name = "ancestors"
path = "ancestors.vcz"
sources = ["mydata"]

[match]
output = "output.trees"

[match.sources.ancestors]
node_flags = 0
create_individuals = false

[match.sources.mydata]

The [[source]] block names a VCZ store. [ancestral_state] says where to find ancestral alleles. [[ancestors]] configures ancestor generation. [match] controls HMM matching and output. Each source that should appear in the output needs a [match.sources.<name>] entry — ancestors use node_flags = 0 (not samples). For the full set of options see the config reference.

Running the pipeline#

Run all steps in one command:

tsinfer run config.toml --threads 4 -v

Or run steps individually:

tsinfer infer-ancestors config.toml --threads 4 -v
tsinfer match config.toml --threads 4 -v

Validate a config before running:

tsinfer config check config.toml

See the CLI reference for all commands and options.

Inspecting the result#

The output is a standard tskit tree sequence:

import tskit

ts = tskit.load("output.trees")
print(f"{ts.num_trees} trees, {ts.num_samples} samples, {ts.num_sites} sites")
ts.draw_svg(size=(600, 300), y_axis=True)

Each diploid individual in the VCZ file corresponds to an individual in the tree sequence with two haploid sample nodes, so ts.num_samples is twice the number of diploid individuals.

Note

Internal node times are allele frequencies, not years or generations. Use tsdate to add meaningful dates. Branch-length statistics on uncalibrated trees will raise an error.

Inference sites#

Not all sites are used for inferring the genealogy. Non-inference sites are included in the final tree sequence with mutations placed by parsimony. These include:

Fixed sites — no variation between samples
Singletons — only one genome carries the derived allele
Unknown ancestral state — ancestral allele does not match any allele
Multiallelic sites — more than two alleles