Quickstart#
Tsinfer infers tree sequences from phased genetic variation data. Input data is stored in VCF Zarr (.vcz) format, and the pipeline is controlled by a TOML configuration file.
The typical workflow is:
Convert a bgzipped VCF to VCZ format
Write a TOML config describing inputs, outputs, and parameters
Run the pipeline via the
tsinferCLIAnalyse the resulting tree sequence with tskit
Preparing input data#
Tsinfer reads phased genotype data from .vcz stores. If you have a
bgzipped, indexed VCF, convert it using
vcf2zarr:
vcf2zarr convert mydata.vcf.gz mydata.vcz
Each site used for inference requires a known ancestral allele. If your VCF
has an AA INFO field, vcf2zarr stores it as variant_AA in the .vcz
store and you can reference it directly in the config. Alternatively, ancestral
alleles can come from a separate VCZ store. For simulated data where the REF
allele is the ancestral allele, set is_reference = true instead of specifying
a field. See the config reference for details.
Writing the config#
The TOML config tells tsinfer where to find inputs, where to write outputs, and what parameters to use. Here is a minimal example:
[[source]]
name = "mydata"
path = "mydata.vcz"
[ancestral_state]
path = "mydata.vcz"
field = "variant_AA"
[[ancestors]]
name = "ancestors"
path = "ancestors.vcz"
sources = ["mydata"]
[match]
output = "output.trees"
[match.sources.ancestors]
node_flags = 0
create_individuals = false
[match.sources.mydata]
The [[source]] block names a VCZ store. [ancestral_state] says where to
find ancestral alleles. [[ancestors]] configures ancestor generation.
[match] controls HMM matching and output. Each source that should appear in
the output needs a [match.sources.<name>] entry — ancestors use
node_flags = 0 (not samples). For the full set of options see the
config reference.
Running the pipeline#
Run all steps in one command:
tsinfer run config.toml --threads 4 -v
Or run steps individually:
tsinfer infer-ancestors config.toml --threads 4 -v
tsinfer match config.toml --threads 4 -v
Validate a config before running:
tsinfer config check config.toml
See the CLI reference for all commands and options.
Inspecting the result#
The output is a standard tskit tree sequence:
import tskit
ts = tskit.load("output.trees")
print(f"{ts.num_trees} trees, {ts.num_samples} samples, {ts.num_sites} sites")
ts.draw_svg(size=(600, 300), y_axis=True)
Each diploid individual in the VCZ file corresponds to an individual in the
tree sequence with two haploid sample nodes, so ts.num_samples is twice the
number of diploid individuals.
Note
Internal node times are allele frequencies, not years or generations. Use tsdate to add meaningful dates. Branch-length statistics on uncalibrated trees will raise an error.
Inference sites#
Not all sites are used for inferring the genealogy. Non-inference sites are included in the final tree sequence with mutations placed by parsimony. These include:
Fixed sites — no variation between samples
Singletons — only one genome carries the derived allele
Unknown ancestral state — ancestral allele does not match any allele
Multiallelic sites — more than two alleles