Tskit for population genetics

`Tskit` for population genetics#

Tskit, the tree sequence toolkit, brings the power of evolutionary trees to the field of population genetics. The succinct tree sequence format is designed to store DNA sequences jointly with their ancestral history (the “genetic genealogy” or ARG). Storing population genetic data in this form enables highly efficient computation and analysis.

The core tskit library provides methods for storing genetic data, a flexible analysis framework, and APIs to build your own efficient population genetic algorithms. Because of its speed and scalability, tskit is well-suited to interactive analysis of large genomic datasets.

Population genetic simulation#

Several simulation tools output tree sequences. Below we use the standard library for population genetic simulation models (stdpopsim) to generate a model of Homo sapiens, in which African, Eurasian, and Asian populations combine to generate a mixed American population. We can use the demesdraw package to plot a schematic of the migrations and population size changes that define this model.

import stdpopsim
import demesdraw
from matplotlib import pyplot as plt

species = stdpopsim.get_species("HomSap")
model = species.get_demographic_model("AmericanAdmixture_4B18")

# Plot a schematic of the model
demesdraw.tubes(model.model.to_demes(), ax=plt.gca(), seed=1, log_time=True)
plt.show()

_images/4b725187527b6ef29e3710af0c5e6595c4bddb6abe1713925ab5dbabf788e2de.png

Genomic data in tree sequence format can be generated via the widely-used msprime simulator. Here we simulate 20 kilobases of genome sequence at the start of human chromosome 1 under this model, together with its evolutionary history. We generate 16 diploid genomes: 4 from each of the populations in the model. The DNA sequences and their ancestry are stored in a succinct tree sequence named ts:

contig = species.get_contig("chr1", mutation_rate=model.mutation_rate, right=20_000)
samples = {"AFR": 4, "EUR": 4, "ASIA": 4, "ADMIX": 4} # 16 diploid samples
engine = stdpopsim.get_engine("msprime")
ts = engine.simulate(model, contig, samples, seed=9).trim()  # trim to first 20kb simulated
print(f"Simulated a tree sequence of {ts.num_samples} haploid genomes:")
print(f"{ts.num_sites} variable sites over {ts.sequence_length} base pairs")

Simulated a tree sequence of 32 haploid genomes:
39 variable sites over 20000.0 base pairs

We can now inspect alleles and their frequencies at the variable sites we have simulated along the genome:

for v in ts.variants():
    display(v)
    if v.site.id >= 2: #  Only show site 0, 1, and 2, for brevity
        break

Variant
Site Id	0
Site Position	1348.0
Number of Samples	32
Number of Alleles	2
Samples with Allele 'A'	25 (78%)
Samples with Allele 'C'	7 (22%)
Has Missing Data	False
Isolated as Missing	True

Variant
Site Id	1
Site Position	2062.0
Number of Samples	32
Number of Alleles	2
Samples with Allele 'C'	31 (97%)
Samples with Allele 'A'	1 (3.1%)
Has Missing Data	False
Isolated as Missing	True

Variant
Site Id	2
Site Position	3046.0
Number of Samples	32
Number of Alleles	2
Samples with Allele 'C'	8 (25%)
Samples with Allele 'A'	24 (75%)
Has Missing Data	False
Isolated as Missing	True

Or we can display the haplotypes() (i.e. the variable sites) for each sample

samples = ts.samples()
for sample_id, h in zip(samples, ts.haplotypes(samples=samples)):
    pop = ts.node(sample_id).population
    print(f"Sample {sample_id:<2} ({ts.population(pop).metadata['name']:^5}): {h}")

Sample 0  ( AFR ): CCCACGCGCGGTACAATCTTTGTGACCGTTAGAGCTTTA
Sample 1  ( AFR ): CCCAAACGCAGTACAACGTCAGGGACCGTTAGAGCTTGG
Sample 2  ( AFR ): ACAACGCGCGTTACAATCTTTGTGACCGTTAGAGCTTTG
Sample 3  ( AFR ): CCCAAACGCAGTACAACCTCAGGGACCGTTAGAGCTTTA
Sample 4  ( AFR ): CCCAAACCTAGTAAAACCTCATGGCCCGTTAGTGCGCTT
Sample 5  ( AFR ): ACAACGCGCGGTACAATCTTTGTAACCGTTAGAGCTTTA
Sample 6  ( AFR ): CCCAAACCTAGTAAAACCTCATGGCCCGTTAGTGCGCTT
Sample 7  ( AFR ): CCCAAACGCAGTACAACCTCAGGGACCGTTAGAGCTTTA
Sample 8  ( EUR ): ACAACGCGCGGTACAATCTTTGTAACCGTTAGAGCTTTG
Sample 9  ( EUR ): ACAACGCGCGGTACAATCTTTGTAAGCGTTAGAGCTTTG
Sample 10 ( EUR ): AAAACGCGCGGTACAATCTTTGTAACCGTAAGAGCTTTG
Sample 11 ( EUR ): ACAACGCGCGGTACAATCTTTGTAACCGTTAGAGCTTTG
Sample 12 ( EUR ): ACAACGCGCGGTACAATCTTTGTAACCGTAAGAGCTTTG
Sample 13 ( EUR ): ACAACGCGCGGTACAATCTTTGTAACCGTAAGAGCTTTG
Sample 14 ( EUR ): ACAACGCGCGGTACAATCTTTGTAACCGTAAGAGCTTTG
Sample 15 ( EUR ): ACAACGCGCGGTACAATCTTTGTAACCGTAAGAGCTTTG
Sample 16 (ASIA ): ACAACGCGCGGTACAATCTTTGTAAGCGTTAGAGCTTTG
Sample 17 (ASIA ): ACAACGCGCGGTACAATCTTTGTAACCGTAAGAGCTTTG
Sample 18 (ASIA ): ACAACGCGCGGTACAATCTTTGTAACCGGTAGAGCTTTG
Sample 19 (ASIA ): ACAACGCGCGGTACAATCTTTGTAACCGTAAGAGCTTTG
Sample 20 (ASIA ): ACAACGCGCGGTACAATCTTTGTAACCGTTAAAGCTTTG
Sample 21 (ASIA ): ACAACGCGCGGTACAATCTTTGTAAGCGTTAGAGCTTTG
Sample 22 (ASIA ): ACATCGCGCGGTACAATCTTTGTAACCGTTAAAGCTTTG
Sample 23 (ASIA ): ACAACGCGCGGTACAATCTTTGTAACCGTTAGAGCTTTG
Sample 24 (ADMIX): ACAACGCGCGTTACAATCTTTGTGACCGTTAGAGCTTTG
Sample 25 (ADMIX): ACAACGCGCGGTACAATCTTTGTAACCCTAAGAGTTTTG
Sample 26 (ADMIX): ACAACGCGCGGTACAATCTTTGTGACGGTTAGAGCTTTG
Sample 27 (ADMIX): ACAACGCGCGGTCCGATCTTTGTAAGCGTTAGAGCTTTG
Sample 28 (ADMIX): ACAACGCGCGGCACAATCTTTGTAACCGGTAGAGCTTTG
Sample 29 (ADMIX): CCCAAACGCAGTACAACCTCAGGGACCGTTTGTGCGCTT
Sample 30 (ADMIX): ACAACGCGCGGTACACTCTTTGTAACCGTAAGAGCTTTG
Sample 31 (ADMIX): ACCAAATGCAGTACAACCACAGGGACCGTTAGATCTTTA

From the tree sequence it is easy to obtain the TreeSequence.allele_frequency_spectrum() for the entire region (or for windowed regions)

afs = ts.allele_frequency_spectrum()
plt.bar(range(ts.num_samples + 1), afs)
plt.title("Allele frequency spectrum")
plt.show()

_images/559a2a9f3bbb5dd37bd3ed92a92d326eb65b655a6788f3919d35401fc1702863.png

Similarly tskit allows fast and easy calculation of statistics along the genome. Here is a plot of windowed \(F_{st}\) between Africans and admixed Americans over this short region of chromosome:

# Define the samples between which Fst will be calculated
pop_id = {p.metadata["name"]: p.id for p in ts.populations()}
sample_sets=[ts.samples(pop_id["AFR"]), ts.samples(pop_id["ADMIX"])]

# Do the windowed calculation, using windows of 2 kilobases
windows = list(range(0, int(ts.sequence_length + 1), 2_000))
F_st = ts.Fst(sample_sets, windows=windows)

# Plot
plt.stairs(F_st, windows, baseline=None)
plt.ylabel("AFR-ADMIX Fst")
plt.xlabel("Genome position")
plt.show()

_images/a917ed008cdf3915e6d3b4530f65780ab3063be50d6fb129bbe87c641e700df9.png

Extracting the genetic tree at a specific genomic location is easy using tskit, which also provides methods to plot these trees. Here we grab the tree at position 10kb, and colour the different populations by different colours, as described in the viz tutorial:

tree = ts.at(10_000)

colours = dict(AFR="yellow", EUR="cyan", ASIA="green", ADMIX="red")
styles = [
    f".leaf.p{pop.id} > .sym {{fill: {colours[pop.metadata['name']]}}}"
    for pop in ts.populations()
]

styles += [ # rotate the population labels, etc
    ".leaf > .lab {text-anchor: start; transform: rotate(90deg) translate(6px)}",
    ".leaf > .sym {stroke: black}"
]

labels = { # Label samples by population
    u: ts.population(ts.node(u).population).metadata["name"].capitalize()
    for u in ts.samples()
}

tree.draw_svg(
    size=(800, 500),
    canvas_size=(800, 520),
    node_labels=labels,
    style="".join(styles),
    y_axis=True,
    y_ticks=range(0, 30_000, 10_000)
)

_images/d09e2fab53c053d487db2603fce5ba5320a41ac7de51e53a445c633b184a48a6.svg

Population genetic inference#

If, instead of simulations, you want to analyse existing genomic data (for example stored in a VCF file), you will need to infer a tree sequence from it, using e.g. tsinfer. Here we load an illustrative portion of an inferred tree sequence based on about 7500 public human genomes, including genomes from the Thousand Genomes Project and Human Genome Diversity Project. The genomic region encoded in this tree sequence has been cut down to span positions 108Mb-110Mb of human chromosome 2, which spans the EDAR gene.

Note that tree sequence files are usually imported using load(), but because this file has been additionally compressed, we load it via tszip.decompress():

import tszip
ts = tszip.decompress("data/unified_genealogy_2q_108Mb-110Mb.tsz")

# The ts encompasses a region on chr 2 with an interesting SNP (rs3827760) in the EDAR gene
edar_gene_bounds = [108_894_471, 108_989_220]  # In Mb from the start of chromosome 2
focal_variant = [v for v in ts.variants() if v.site.metadata.get("ID") == "rs3827760"].pop()
print("An interesting SNP within the EDAR gene:")
focal_variant

An interesting SNP within the EDAR gene:

Variant
Site Id	9378
Site Position	108897145.0
Number of Samples	836
Number of Alleles	2
Samples with Allele 'A'	384 (46%)
Samples with Allele 'G'	452 (54%)
Has Missing Data	False
Isolated as Missing	True

For simplicity, this tree sequence has been simplified to include only those samples from the African and East Asian regions. These belong to a number of populations. The population information, as well as information describing the variable sites, is stored in tree sequence metadata:

import pandas as pd

print(ts.num_populations, "populations defined in the tree sequence:")

pop_names_regions = [
    [p.metadata.get("name"), p.metadata.get("region")]
    for p in ts.populations()
]
display(pd.DataFrame(pop_names_regions, columns=["population name", "region"]))

67 populations defined in the tree sequence:

	population name	region
0	BantuKenya	AFRICA
1	BantuSouthAfrica	AFRICA
2	Biaka	AFRICA
3	Cambodian	EAST_ASIA
4	Dai	EAST_ASIA
...	...	...
62	Uygur	EastAsia
63	Xibo	EastAsia
64	Yi	EastAsia
65	Yoruba	Africa
66	Denisovan	None

67 rows × 2 columns

You can see that there are multiple African and East asian populations, grouped by region. Here we collect two lists of IDs for the sample nodes from the African region and from the East asian region:

sample_lists = {}
for n, rgns in {"Africa": {'AFRICA', 'Africa'}, "East asia": {'EAST_ASIA', 'EastAsia'}}.items():
    pop_ids = [p.id for p in ts.populations() if p.metadata.get("region") in rgns]
    sample_lists[n] = [u for p in pop_ids for u in ts.samples(population=p)]

With these lists we can calculate different windowed statistics (here genetic diversity and Tajima's D) within each of these regions:

edar_ts = ts.trim()  # remove regions with no data (changes the coordinate system)
windows = list(range(0, int(edar_ts.sequence_length)+1, 10_000))
data = {
    "Genetic diversity": {
        region: edar_ts.diversity(samples, windows=windows)
        for region, samples in sample_lists.items()
    },
    "Tajima's D": {
        region: edar_ts.Tajimas_D(samples, windows=windows)
        for region, samples in sample_lists.items()
    },  
}

# Plot the `data`
fig, axes = plt.subplots(ncols=2, figsize=(15, 3))
start = ts.edges_left.min()  # the empty amount at the start of the tree sequence

for (title, plot_data), ax in zip(data.items(), axes):
    ax.set_title(title)
    ax.axvspan(edar_gene_bounds[0], edar_gene_bounds[1], color="lightgray")
    ax.axvline(focal_variant.site.position, ls=":")
    for label, stat in plot_data.items():
        ax.stairs(stat, windows+start, baseline=None, label=label)
    ax.text(edar_gene_bounds[0], 0, "EDAR")
    ax.legend()
plt.show()

_images/6f34aabb7cfa0d4a12d33c1dbf651063d3953e3b56db7aaaf9e64f964ae182af.png

Other population genetic libraries such as scikit-allel (which is interoperable with tskit) could also have been used to produce the plot above. In this case, the advantage of using tree sequences is simply that they allow these sorts of analysis to scale to datasets of millions of whole genomes.

Topological analysis#

As this inferred tree sequence stores (an estimate of) the underlying genealogy, we can also derive statistics based on genealogical relationships. For example, this tree sequence also contains a sample genome based on an ancient genome, a Denisovan individual. We can look at the closeness of relationship between samples from the different geographical regions and the Denisovan:

Todo

Show an example of looking at topological relationships between the Denisovan and various East Asian groups, using the Counting topologies functionality.

See Counting topologies for an introduction to topological methods in tskit.

Further information#

This brief introduction is meant as a simple taster. Many other efficient population genetic analyses are possible when you have genomic data stored as a tree sequence.

The rest of the tutorials contain a large number of examples which are relevant to population genetic analysis and research. You can also visit the learning section of the tskit website.