Usage#

Toy example#

Tsinfer takes as input a Zarr file, with phased variant data encoded in the VCF Zarr (.vcz) format. For simplicity, we first demonstrate using a pre-generated .vcz file (later, in the Data example section, we describe how to create such a file e.g. from a VCF using vcf2zarr).

import zarr
vcf_zarr = zarr.open("_static/example_data.vcz")

Here’s what the genotypes stored in that datafile look like:

Diploid sample id:     0           1           2      
Genome for sample:   0    1      0    1      0    1    
------------------------------------------------------
      position 16:   T    T      T    G      T    G      
      position 44:   G    G      G    G      T    T      
      position 50:   G    G      G    C      C    C      
      position 55:   A    T      T    T      A    T      
      position 71:   C    C      C    C      C    C      
      position 75:   T    T      T    A      T    T      
      position 85:   C    G      G    C      C    C      
      position 95:   AGGT AGGT   A    AGGT   AGGT AGGT   

Note

The last site, at position 95, is an indel (insertion or deletion). Indels can be used for inference as long as the indel does not overlap with other variants, only two alleles exist, and the ancestral state is known.

VariantData#

Tsinfer produces a genealogy that could have given rise to the data set above, based on the sites that vary between the samples. To provide extra information to the algorithm, you must wrap the .vcz file in a lightweight tsinfer.VariantData object, using syntax like:

vdata = tsinfer.VariantData("file.vcz", ancestral_state=***, ...)

Ancestral states#

Importantly, the VariantData object requires an ancestral state to be provided for each site used in inference. There are many methods for determing ancestral states: details are outside the scope of this manual, but we have started a discussion topic on this issue to provide some recommendations.

Ancestral states can be specified in several ways:

  • Using an existing array in the .vcz file. Some VCF files already contain an ancestral state, e.g. in the AA (“ancestral allele”) info field. In this case the variant_AA field of the .vcz file can be specified, as follows.

    vdata = tsinfer.VariantData("file.vcz", ancestral_state="variant_AA")
    

    A worked example is shown in the real data example later on this page.

  • Using a numpy array of ancestral state strings, of the same length as the number of unmasked variants. For example, if you wish to treat the reference (“REF”) allele as the ancestral state, you can take advantage of the fact that the .vcz format always stores the zeroth allele as the REF, suggesting the following syntax:

    vcf_zarr = zarr.load("file.vcz")
    vdata = tsinfer.VariantData("file.vcz", ancestral_state=vcf_zarr["variant_allele"][:, 0])
    

    A worked example of using an array of strings is provided in the simulation example later on this page.

  • Using a single string of ancestral states, e.g. from a FASTA file. The string should cover the entire genetic sequence, such that the ith character in the string is taken as the ancestral state for an inference site at position i. In this case, the add_ancestral_state_array() method can be used to extract the states and save them to the VCF Zarr dataset, under the name ancestral_state. Note that if, as is common, variant positions in the .vcz file are one-based (starting at 1), rather than zero-based, you should add a padding character at the start of the string.

Below we illustrate the single string method, using a stored FASTA file. In this file, the 16th, 44th, 50th, 55th, 71st, 75th, 85th, and 95th characters are G, G, C, T, C, A, T, and A (note that the 85th character, T, does not match any of the alleles in the .vcz genotypes for position 85).

import tsinfer
import zarr
import pyfaidx

vcf_zarr = zarr.open("_static/example_data.vcz")

reader = pyfaidx.Fasta("_static/example_ancestral_state.fa")
ancestral_str = str(reader["chr1"])
# We consider positions in the .vcz file to be one-based, so we prepend an X to the string
ancestral_str = "X" + ancestral_str

tsinfer.add_ancestral_state_array(vcf_zarr, ancestral_str) 
vdata = tsinfer.VariantData("_static/example_data.vcz", ancestral_state="ancestral_state")
/home/runner/work/tskit-site/tskit-site/tsinfer/formats.py:2587: UserWarning: An ancestral allele was not found in the variant_allele array for the 1 sites (12.50%) listed below. They will be treated as of unknown ancestral state:
 'T': 1 (12.50% of sites)
  warnings.warn(

Because the ancestral state at position 85 does not match any of the alleles at that site, a warning has been given that the ancestral state will be considered unknown. As a consequence, although it will appear in the inferred tree sequence, the site at position 85 will be treated as a “noninference” site.

Inference sites#

Certain sites are not used by tsinfer for inferring the genealogy. These noninference sites are nevertheless included in the final tree sequence, but with their mutations placed by parsimony. Such sites include:

  • Non-variable (fixed) sites, e.g. the site at position 71 above

  • Singleton sites, where only one genome has the derived allele e.g. the site at position 75 above

  • Sites where the ancestral allele is unknown, e.g. site 85 (see above).

  • Multialleleic sites, with more than 2 alleles (but see here for a workaround)

Additional sites can be deliberately flagged as not-for-use in inference, for example if their genotypes or ancestral states are deemed unreliable, via the exclude_positions parameter when running the inference step of tsinfer.

Topology inference#

Once our data is wrapped in a VariantData object, we can infer a tree sequence e.g. using tsinfer’s Python API. Note that each sample in the original .vcz file will correspond to an individual in the resulting tree sequence. Since these three individuals are diploid, the resulting tree sequence will have ts.num_samples == 6 (unlike in a .vcz file, a “sample” in tskit refers to a haploid genome, not a diploid individual).

inferred_ts = tsinfer.infer(vdata)
print(f"Inferred a genetic genealogy for {inferred_ts.num_samples} (haploid) genomes")
Inferred a genetic genealogy for 6 (haploid) genomes

And that’s it: we now have a fully functional tskit.TreeSequence object that we can interrogate in the usual ways. For example, we can look at the variants in the tree sequence:

print("TS sample", "\t".join(str(s) for s in inferred_ts.samples()), sep="\t")
print("TS individual", "\t".join(str(inferred_ts.node(s).individual) for s in inferred_ts.samples()), sep="\t")
print("-" * 60)
print("Pos (anc state)")
for v in inferred_ts.variants():
    print(
        f"{int(v.site.position)}   ({v.site.ancestral_state})",
        "\t".join(f"{a:<4}" for a in np.array(v.alleles)[v.genotypes]),
        sep="\t")
TS sample	0	1	2	3	4	5
TS individual	0	0	1	1	2	2
------------------------------------------------------------
Pos (anc state)
16   (G)	T   	T   	T   	G   	T   	G   
44   (G)	G   	G   	G   	G   	T   	T   
50   (C)	G   	G   	G   	C   	C   	C   
55   (T)	A   	T   	T   	T   	A   	T   
71   (C)	C   	C   	C   	C   	C   	C   
75   (A)	T   	T   	T   	A   	T   	T   
85   (C)	C   	G   	G   	C   	C   	C   
95   (A)	AGGT	AGGT	A   	AGGT	AGGT	AGGT

Comparing the genotypes of each individual in the inferred tree sequence against the equivalent diploid samples in the .vcz file, you can see that they are identical. Apart from the imputation of missing data, tsinfer is guaranteed to losslessly encode any genetic variation data, regardless of the inferred topology. You can check this programatically if you want:

import numpy as np
for v_orig, v_inferred in zip(vdata.variants(), inferred_ts.variants()):
    if any(
        np.array(v_orig.alleles)[v_orig.genotypes] !=
        np.array(v_inferred.alleles)[v_inferred.genotypes]
    ):
        raise ValueError("Genotypes in original dataset and inferred tree seq not equal")
print("** Genotypes in original dataset and inferred ts are identical **")
** Genotypes in original dataset and inferred ts are identical **

We can examine the inferred genetic genealogy, in the form of local trees. Tsinfer has placed mutations on the genealogy to explain the observed genetic variation:

mut_labels = {
    m.id: "{:g}: {}{}".format(
        s.position,
        inferred_ts.mutation(m.parent).derived_state if m.parent >= 0 else s.ancestral_state,
        m.derived_state,
    )
    for s in inferred_ts.sites()
    for m in s.mutations
}
node_labels = {u: u for u in inferred_ts.samples()}

inferred_ts.draw_svg(
    size=(600, 300),
    canvas_size=(640, 300),
    node_labels=node_labels,
    mutation_labels=mut_labels,
    y_axis=True
)
_images/66ae6c47448c8890a2b94a09f33b516810081373ed75d9e9f7a12d798d73143b.svg

We have inferred 4 trees along the genome. Note that there are “polytomies” in the trees, where some nodes have more than two children (e.g. the root node in the first tree). This is a common feature of trees inferred by tsinfer and signals that there was not sufficient information to resolve the tree at this internal node. By default, the ancestry in the flanking regions (to the left of the first provided site and the right of the last provided site) is considered unknown, and hence not drawn above.

Each internal (non-sample) node in this inferred genealogy represents an ancestral sequence, constructed on the basis of shared, derived alleles at one or more of the sites. By default, the time of each such node is not measured in years or generations, but is simply the frequency of the shared derived allele(s) on which the ancestral sequence is based. For this reason, the time units are described as “uncalibrated” in the plot, and trying to calculate statistics based on branch lengths will raise an error:

inferred_ts.diversity(mode="branch")
---------------------------------------------------------------------------
LibraryError                              Traceback (most recent call last)
/tmp/ipykernel_2993/2935518142.py in ?()
----> 1 inferred_ts.diversity(mode="branch")

/opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/site-packages/tskit/trees.py in ?(self, sample_sets, windows, mode, span_normalise)
   8106             window (defaults to True).
   8107         :return: A numpy array whose length is equal to the number of sample sets.
   8108             If there is one sample set and windows=None, a numpy scalar is returned.
   8109         """
-> 8110         return self.__one_way_sample_set_stat(
   8111             self._ll_tree_sequence.diversity,
   8112             sample_sets,
   8113             windows=windows,

/opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/site-packages/tskit/trees.py in ?(self, ll_method, sample_sets, windows, mode, span_normalise, polarised)
   7813         if np.any(sample_set_sizes == 0):
   7814             raise ValueError("Sample sets must contain at least one element")
   7815 
   7816         flattened = util.safe_np_int_cast(np.hstack(sample_sets), np.int32)
-> 7817         stat = self.__run_windowed_stat(
   7818             windows,
   7819             ll_method,
   7820             sample_set_sizes,

/opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/site-packages/tskit/trees.py in ?(self, windows, method, *args, **kwargs)
   7776     def __run_windowed_stat(self, windows, method, *args, **kwargs):
   7777         strip_dim = windows is None
   7778         windows = self.parse_windows(windows)
-> 7779         stat = method(*args, **kwargs, windows=windows)
   7780         if strip_dim:
   7781             stat = stat[0]
   7782         return stat

LibraryError: Statistics using branch lengths cannot be calculated when time_units is 'uncalibrated'. (TSK_ERR_TIME_UNCALIBRATED)

To add meaningful dates to an inferred tree sequence, allowing meaningful branch length statistics to be calulated, you must use additional software such as tsdate: the tsinfer algorithm is only intended to infer the genetic relationships between the samples (i.e. the topology of the tree sequence).

Masks#

As well as marking sites as not-for-inference, is possible to completely exclude sites and samples, by specifing a boolean site_mask and/or a sample_mask when creating the VariantData object. Sites or samples with a mask value of True will be completely omitted both from inference and the final tree sequence. This can be useful, for example, if you wish to select only a subset of the chromosome for inference, e.g. to reduce computational load. You can also use it to subset inference to a particular contig, if your dataset contains multiple contigs. Note that if a site_mask is provided, the ancestral states array should only specify alleles for the unmasked sites.

Below, for instance, is an example of including only sites up to position six in the contig labelled “chr1” in the example_data.vcz file:

import numpy as np
import zarr

vcf_zarr = zarr.open("_static/example_data.vcz")

# mask out any sites not associated with the contig named "chr1"
# (for demonstration: all sites in this .vcz file are from "chr1" anyway)
chr1_index = np.where(vcf_zarr.contig_id[:] == "chr1")[0]
site_mask = vcf_zarr.variant_contig[:] != chr1_index
# also mask out any sites with a position >= 80
site_mask[vcf_zarr.variant_position[:] >= 80] = True

smaller_vdata = tsinfer.VariantData(
    "_static/example_data.vcz",
    ancestral_state="ancestral_state",
    site_mask=site_mask,
)
print(f"The `smaller_vdata` object returns data for only {smaller_vdata.num_sites} sites")
The `smaller_vdata` object returns data for only 6 sites

Simulation example#

The previous example showed how we can infer a tree sequence using the Python API for a trivial toy example. However, for real data we will not prepare our data and infer the tree sequence all in one go; rather, we will usually split the process into at least two distinct steps.

The first step in any inference is to prepare your data and create the .vcz file. For simplicity here we’ll use Python to simulate some data under the coalescent with recombination, using msprime:

import builtins
import sys
import os
import subprocess

from Bio import bgzf
import numpy as np

import msprime
import tsinfer

if getattr(builtins, "__IPYTHON__", False):  # if running IPython: e.g. in a notebook
    num_diploids, seq_len = 100, 10_000
    name = "notebook-simulation"
    python = sys.executable
else:  # Take parameters from the command-line
    num_diploids, seq_len = int(sys.argv[1]), float(sys.argv[2])
    name = "cli-simulation"
    python = "python"

ts = msprime.sim_ancestry(
    num_diploids,
    population_size=10**4,
    recombination_rate=1e-8,
    sequence_length=seq_len,
    random_seed=6,
)
ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=7)
ts.dump(name + "-source.trees")
print(
    f"Simulated {ts.num_samples} samples over {seq_len/1e6} Mb:",
    f"{ts.num_trees} trees and {ts.num_sites} sites"
)

# Convert to a zarr file: this should be easier once a tskit2zarr utility is made, see
# https://github.com/sgkit-dev/bio2zarr/issues/232
np.save(f"{name}-AA.npy", [s.ancestral_state for s in ts.sites()])
vcf_name = f"{name}.vcf.gz"
with bgzf.open(vcf_name, "wt") as f:
    ts.write_vcf(f)
subprocess.run(["tabix", vcf_name])
ret = subprocess.run(
    [python, "-m", "bio2zarr", "vcf2zarr", "convert", "--force", vcf_name, f"{name}.vcz"],
    stderr = subprocess.DEVNULL if name == "notebook-simulation" else None,
)
assert os.path.exists(f"{name}.vcz")

if ret.returncode == 0:
    print(f"Converted to {name}.vcz")
Simulated 200 samples over 0.01 Mb: 20 trees and 20 sites
Converted to notebook-simulation.vcz

Here we first run a simulation then we create a vcf file and convert it to .vcz format. If we store this code in a file named simulate-data.py, we can run it on the command line, providing parameters to generate much bigger datasets:

% python simulate-data.py 5000 10_000_000
Simulated 10000 samples over 10.0 Mb: 36204 trees and 38988 sites
    Scan: 100%|████████████████████████████████████████████████████████████████████| 1.00/1.00 [00:00<00:00, 2.86files/s]
 Explode: 100%|████████████████████████████████████████████████████████████████████| 39.0k/39.0k [01:17<00:00, 501vars/s]
  Encode: 100%|███████████████████████████████████████████████████████████████████████| 976M/976M [00:12<00:00, 78.0MB/s]
Finalise: 100%|█████████████████████████████████████████████████████████████████████| 10.0/10.0 [00:00<00:00, 421array/s]

Here, we simulated a sample of 10 thousand chromosomes, each 10Mb long, under human-like parameters. That has created 39K segregating sites. The output above shows the state of the vcf-to-zarr conversion at the end of this process, indicating that it took about a minute to convert the data into .vcz format (this only needs doing once)

Examining the files on the command line, we then see the following::

$ du -sh cli-simulation*
156K	cli-simulation-AA.npy
8.8M	cli-simulation-source.trees
 25M	cli-simulation.vcf.gz
8.0K	cli-simulation.vcf.gz.tbi
9.4M	cli-simulation.vcz

The cli-simulation.vcz zarr data is quite small, about the same size as the original msprime tree sequence file. Since (like all zarr datastores) cli-simulation.vcz is a directory we can peer inside to look where everything is stored:

$ du -sh cli-simulation.vcz/*
8.7M	cli-simulation.vcz/call_genotype
 88K	cli-simulation.vcz/call_genotype_mask
 88K	cli-simulation.vcz/call_genotype_phased
 12K	cli-simulation.vcz/contig_id
 12K	cli-simulation.vcz/contig_length
 12K	cli-simulation.vcz/filter_id
 28K	cli-simulation.vcz/sample_id
 52K	cli-simulation.vcz/variant_allele
 24K	cli-simulation.vcz/variant_contig
 24K	cli-simulation.vcz/variant_filter
 48K	cli-simulation.vcz/variant_id
 24K	cli-simulation.vcz/variant_id_mask
144K	cli-simulation.vcz/variant_position
 24K	cli-simulation.vcz/variant_quality

Most of this output is not particularly interesting here, but we can see that the call_genotype array which holds all of the sample genotypes (and thus the vast bulk of the actual data) only requires about 8.7MB compared to 25MB for the compressed vcf file. In practice this means we can keep such files lying around without taking up too much space.

Once we have our .vcz file created, running the inference is straightforward.

# Infer & save a ts from the notebook simulation.
ancestral_states = np.load(f"{name}-AA.npy")
vdata = tsinfer.VariantData(f"{name}.vcz", ancestral_states)
tsinfer.infer(vdata, progress_monitor=True, num_threads=4).dump(name + ".trees")

Running the infer command runs the full inference pipeline in one go (the individual steps are explained here). We provided two extra arguments to infer: progress_monitor gives us the progress bars show above, and num_threads=4 tells tsinfer to use four worker threads whenever it can use them.

Performing this inference on the cli-simulation data using Core i5 processor (launched 2009) with 4GiB of RAM, took about eighteen minutes. The maximum memory usage was about 600MiB.

The output files look like this:

$ ls -lh cli-simulation*
-rw-r--r--  1 user  group   8.2M 14 Oct 13:45 cli-simulation-source.trees
-rw-r--r--  1 user  group    22M 14 Oct 13:45 cli-simulation.samples
-rw-r--r--  1 user  group   8.1M 14 Oct 13:50 cli-simulation.trees

Therefore our output tree sequence file that we have just inferred in a few minutes is even smaller than the original msprime simulated tree sequence! Because the output file is also a tskit.TreeSequence, we can use the same API to work with both. For example, to load up the original and inferred tree sequences from their corresponding .trees files you can simply use the tskit.load() function, as shown below.

However, there are thousands of trees in these tree sequences, each of which has 10 thousand samples: much too much to easily visualise. Therefore, in the code below, we we subset both tree sequences down to their minimal representations for the first six sample nodes using tskit.TreeSequence.simplify(), while keeping all sites, even if they do not show genetic variation between the six selected samples.

Note

Using this tiny subset of the overall data allows us to get an informal feel for the trees that are inferred by tsinfer, but this is certainly not a recommended approach for validating the inference!

Once we’ve subsetted the tree sequences down to something that we can comfortably look at, we can plot the trees in particular regions of interest, for example the first 2kb of genome. Below, for simplicity we use the smaller dataset that was simulated and inferred within the the notebook, but exactly the same code will work with the larger version run from the cli simulation.

import tskit

subset = range(0, 6)  # show first 6 samples
limit = (0, 2_000)    # show only the trees covering the first 2kb

prefix = "notebook"  # Or use "cli" for the larger example
source = tskit.load(prefix + "-simulation-source.trees")
source_subset = source.simplify(subset, filter_sites=False)
print(f"True tree seq, simplified to {len(subset)} sampled genomes")
source_subset.draw_svg(size=(800, 200), x_lim=limit, time_scale="rank")
True tree seq, simplified to 6 sampled genomes
_images/7b4875b7f516e6ba9ecda51d4c74691339006862bf3feb48b547cefa6ed48270.svg
inferred = tskit.load(prefix + "-simulation.trees")
inferred_subset = inferred.simplify(subset, filter_sites=False)
print(f"Inferred tree seq, simplified to {len(subset)} sampled genomes")
inferred_subset.draw_svg(size=(800, 200), x_lim=limit)
Inferred tree seq, simplified to 6 sampled genomes
_images/eb9853b6da8fb56bd43f71c5929c4a5f1c443b1446785e9d4807e94c67198a1c.svg

There are a number of things to note when comparing the plots above. Most obviously, the first tree in the inferred tree sequence is empty: by default, tsinfer will not generate a genealogy before the first site in the genome (here, position 655) or past the last site, as there is, by definition, no data in these regions.

You can also see that the inferred tree sequence has fewer trees than the original, simulated one. There are two reasons for this. First, there are some tree changes that do not involve a change of topology (e.g. the first two trees in the original tree sequence are topologically identical, as are two trees after that). Even if we had different mutations on the different edges in these trees, such tree changes are unlikely to be spotted. Second, our inference depends on the mutational information that is present. If no mutations fall on a particular edge in the tree sequence, then we have no way of inferring that this edge existed. The fact that there are no mutations in the first, third, and fifth trees shows that tree changes can occur without associated mutations. As a result, there will be tree transitions that we cannot pick up. In the simulation that we performed, the mutation rate is equal to the recombination rate, and so we expect that many recombinations will be invisible to us.

For similar reasons, there will be many nodes in the tree at which polytomies occur. Here we correctly infer that sample nodes 0 and 4 group together, as do 1 and 2. However, in the middle inferred tree we were not able to distinguish whether node 3 was closer to node 1 or 2, so we have three children of node 7 in that tree.

Finally, you should note that inference does not always get the right answer. At the first site in the inferred trees, node 5 is placed as an outgroup, but at this site it is actually closer to the group consisting of nodes 0 and 4.

Note

Other than the sample node IDs, it is meaningless to compare node numbers in the source and inferred tree sequences.

Data example#

Inputting real data for inference is similar in principle to the previous examples. All that is required is a .vcz file, which can be created using vcf2zarr as above.

Reading a VCF#

For example data, we use a publicly available VCF file of the genetic variants from chromosome 24 of ten Norwegian and French house sparrows, Passer domesticus (thanks to Mark Ravinet for the data file):

import zarr

vcf_location = "_static/P_dom_chr24_phased.vcf.gz"
!python -m bio2zarr vcf2zarr convert --force {vcf_location} sparrows.vcz

This creates the sparrows.vcz datastore, which we open using tsinfer.VariantData. The original VCF had the ancestral allelic state specified in the AA INFO field, so we can simply provide the string "variant_AA" as the ancestral_state parameter.

# Do the inference: this VCF has ancestral states in the AA field
vdata = tsinfer.VariantData("sparrows.vcz", ancestral_state="variant_AA")
ts = tsinfer.infer(vdata)
print(
    "Inferred tree sequence: {} trees over {} Mb ({} edges)".format(
        ts.num_trees, ts.sequence_length / 1e6, ts.num_edges
    )
)
Inferred tree sequence: 6861 trees over 7.077728 Mb (37741 edges)

On a modern computer, this should only take a few seconds to run.

Adding more metadata#

We can add additional data to the zarr file, which will make it through to the tree sequence. For instance, we might want to mark which population each individual comes from. This can be done by adding some descriptive metadata for each population, and the assigning each sample to one of those populations. In our case, the sample sparrow IDs beginning with “FR” are from France:

import json
import numpy as np
import tskit
import zarr

vcf_zarr = zarr.load("sparrows.vcz")

populations = ("Norway", "France")
# save the population data in json format
schema = json.dumps(tskit.MetadataSchema.permissive_json().schema).encode()
zarr.save("sparrows.vcz/populations_metadata_schema", schema)
metadata = [
    json.dumps({"name": pop, "description": "The country from which this sample comes"}).encode()
    for pop in populations
]
zarr.save("sparrows.vcz/populations_metadata", metadata)

# Now assign each diploid sample to a population
num_individuals = vcf_zarr["sample_id"].shape[0]
individuals_population = np.full(num_individuals, tskit.NULL, dtype=np.int32)
for i, name in enumerate(vcf_zarr["sample_id"]):
    if name.startswith("FR"):
        individuals_population[i] = populations.index("France")
    else:
        individuals_population[i] = populations.index("Norway")
zarr.save("sparrows.vcz/individuals_population", individuals_population)

Tsinfer inference#

Note that the steps above to generate a .vcz file are not strictly part of tsinfer. We only invoke tsinfer subsequently, when creating a VariantData object. Moreover, tsinfer treats the .vcz information as read-only, and does not make a copy of it. This means tsinfer is well-suited to using publicly provided, read-only .vcz datafiles. Furthermore, Masks make it easy to use subsets of such datafiles, e.g. if they contain multiple chromosomes or more samples than are required for your analysis.

As the .vcz file we are now using contains population metadata, tsinfer will create a tree sequence whose sample nodes are correctly assigned to named populations:

vdata = tsinfer.VariantData("sparrows.vcz", ancestral_state="variant_AA", individuals_population="individuals_population")
sparrow_ts = tsinfer.infer(vdata)

for sample_node_id in sparrow_ts.samples():
    individual_id = sparrow_ts.node(sample_node_id).individual
    population_id = sparrow_ts.node(sample_node_id).population
    print(
        "Node",
        sample_node_id,
        "labels a chr24 sampled from individual",
        sparrow_ts.individual(individual_id).metadata["variant_data_sample_id"],
        "in",
        sparrow_ts.population(population_id).metadata["name"],
    )
Node 0 labels a chr24 sampled from individual 8934547 in Norway
Node 1 labels a chr24 sampled from individual 8934547 in Norway
Node 2 labels a chr24 sampled from individual 8L19766 in Norway
Node 3 labels a chr24 sampled from individual 8L19766 in Norway
Node 4 labels a chr24 sampled from individual 8M31651 in Norway
Node 5 labels a chr24 sampled from individual 8M31651 in Norway
Node 6 labels a chr24 sampled from individual 8N05890 in Norway
Node 7 labels a chr24 sampled from individual 8N05890 in Norway
Node 8 labels a chr24 sampled from individual 8N73604 in Norway
Node 9 labels a chr24 sampled from individual 8N73604 in Norway
Node 10 labels a chr24 sampled from individual FR041 in France
Node 11 labels a chr24 sampled from individual FR041 in France
Node 12 labels a chr24 sampled from individual FR044 in France
Node 13 labels a chr24 sampled from individual FR044 in France
Node 14 labels a chr24 sampled from individual FR046 in France
Node 15 labels a chr24 sampled from individual FR046 in France
Node 16 labels a chr24 sampled from individual FR048 in France
Node 17 labels a chr24 sampled from individual FR048 in France
Node 18 labels a chr24 sampled from individual FR050 in France
Node 19 labels a chr24 sampled from individual FR050 in France

Analysis#

To analyse your inferred tree sequence you can use all the analysis functions built in to the tskit library. The tskit tutorial provides much more detail. Below we just give a flavour of the possibilities.

To quickly eyeball small datasets, we can draw the entire tree sequence, or draw() the tree at any particular genomic position. The following code demonstrates how to use the tskit.TreeSequence.at() method to obtain the tree 1Mb from the start of the sequence, and plot it, colouring the tips according to population:

colours = {"Norway": "red", "France": "blue"}
colours_for_node = {}
for n in sparrow_ts.samples():
    population_data = sparrow_ts.population(sparrow_ts.node(n).population)
    colours_for_node[n] = colours[population_data.metadata["name"]]

individual_for_node = {}
for n in sparrow_ts.samples():
    individual_data = sparrow_ts.individual(sparrow_ts.node(n).individual)
    individual_for_node[n] = individual_data.metadata["variant_data_sample_id"]

tree = sparrow_ts.at(1e6)
tree.draw(
    path="tree_at_1Mb.svg",
    height=700,
    width=1200,
    node_labels=individual_for_node,
    node_colours=colours_for_node,
)
_images/7c3fffd29352d80198277b5794ca5bc1fea5854d394f231f19e253b6aacdde14.svg

This tree seems to suggest that Norwegian and French individuals may not fall into discrete groups on the tree, but be part of a larger mixing population. Note, however, that this is only one of thousands of trees, and may not be typical of the genome as a whole. Additionally, most data sets will have far more samples than this example, so trees visualized in this way are likely to be huge and difficult to understand. As in the simulation example above, one possibility is to simplify() the tree sequence to a limited number of samples, but it is likely that most studies will instead rely on various statistical summaries of the trees. Storing genetic data as a tree sequence makes many of these calculations fast and efficient, and tskit has both a set of commonly used methods and a framework that generalizes population genetic statistics. For example, the allele or site frequency spectrum (SFS) can be calculated using tskit.TreeSequence.allele_frequency_spectrum() and the allelic diversity (“Tajima’s \({\pi}\)”) using tskit.TreeSequence.diversity(), both of which can also be calculated locally (e.g. per tree or in genomic windows). As a basic example, here’s how to calculate genome-wide \(F_{st}\) between the Norwegian and French (sub)populations:

samples_listed_by_population = [
    sparrow_ts.samples(population=pop_id)
    for pop_id in range(sparrow_ts.num_populations)
]

print("Fst between populations:", sparrow_ts.Fst(samples_listed_by_population))
Fst between populations: 0.014022798331082553

As noted above, the times of nodes are uncalibrated so we shouldn’t perform calculations that reply on branch lengths. However, some statistics, such as the genealogical nearest neighbour (GNN) proportions are calculated from the topology of the trees. Here’s an example, using the individual and population metadata to format the results table in a tidy manner

import pandas as pd

gnn = sparrow_ts.genealogical_nearest_neighbours(
    sparrow_ts.samples(), samples_listed_by_population
)

# Tabulate GNN nicely using a Pandas dataframe with named rows and columns
sample_nodes = [sparrow_ts.node(n) for n in sparrow_ts.samples()]
sample_ids = [n.id for n in sample_nodes]
sample_names = [
    sparrow_ts.individual(n.individual).metadata["variant_data_sample_id"]
    for n in sample_nodes
]
sample_pops = [
    sparrow_ts.population(n.population).metadata["name"]
    for n in sample_nodes
]
gnn_table = pd.DataFrame(
    data=gnn,
    index=[
        pd.Index(sample_ids, name="Sample node"),
        pd.Index(sample_names, name="Bird"),
        pd.Index(sample_pops, name="Country"),
    ],
    columns=[p.metadata["name"] for p in sparrow_ts.populations()],
)

print(gnn_table)
# Summarize GNN for all birds from the same country
print(gnn_table.groupby(level="Country").mean())
                               Norway    France
Sample node Bird    Country                    
0           8934547 Norway   0.543012  0.456988
1           8934547 Norway   0.536100  0.463900
2           8L19766 Norway   0.516064  0.483936
3           8L19766 Norway   0.490565  0.509435
4           8M31651 Norway   0.534419  0.465581
5           8M31651 Norway   0.615497  0.384503
6           8N05890 Norway   0.540100  0.459900
7           8N05890 Norway   0.556888  0.443112
8           8N73604 Norway   0.554031  0.445969
9           8N73604 Norway   0.500408  0.499592
10          FR041   France   0.439949  0.560051
11          FR041   France   0.460082  0.539918
12          FR044   France   0.484670  0.515330
13          FR044   France   0.431144  0.568856
14          FR046   France   0.530776  0.469224
15          FR046   France   0.518039  0.481961
16          FR048   France   0.494857  0.505143
17          FR048   France   0.487018  0.512982
18          FR050   France   0.531259  0.468741
19          FR050   France   0.485466  0.514534
           Norway    France
Country                    
France   0.486326  0.513674
Norway   0.538708  0.461292

From this, it can be seen that the genealogical nearest neighbours of birds in Norway tend also to be in Norway, and vice versa for birds from France. In other words, there is a small but noticable degree of population structure in the data. The bird 8L19766 and one of the chromosomes of bird FR046 seem to buck this trend, and it would probably be worth checking these data points further (perhaps they are migrant birds), and verifying if other chromosomes in these individuals show the same pattern.

Much more can be done with the genomic statistics built into tskit. For further information, please refer to the statistics section of the tskit documentation.