{
"cells": [
{
"cell_type": "markdown",
"id": "32a5906f",
"metadata": {},
"source": [
"```{currentmodule} tskit\n",
"```\n",
"\n",
"\n",
"(sec_stats)=\n",
"\n",
"# Statistics\n",
"\n",
"The `tskit` library provides a large number of functions for calculating population\n",
"genetic statistics from tree sequences. Statistics can reflect either the distribution\n",
"of genetic variation or the underlying trees that generate it; the\n",
"[duality](https://doi.org/10.1534/genetics.120.303253) between\n",
"mutations and branch lengths on trees means that statistics based on genetic variation\n",
"often have an corresponding version based on branch lengths in trees in a tree sequence.\n",
"\n",
"Note that `tskit` provides a unified interface for computing so-called\n",
"\"single site statistics\" that are summaries across sites in windows of the genome, as\n",
"well as a standard method for specifying \"multi-way\" statistics that are calculated\n",
"over many combinations of sets of samples simultaneously.\n",
"\n",
"Please see the {ref}`tutorial ` for examples of the\n",
"statistics API in use.\n",
"\n",
":::{warning}\n",
"{ref}`sec_data_model_missing_data` is not currently\n",
"handled correctly by site statistics defined here, as we always\n",
"assign missing data to be equal to the ancestral state. Later\n",
"versions will add this behaviour as an option and will account\n",
"for the presence of missing data by default.\n",
":::\n",
"\n",
"(sec_stats_available)=\n",
"\n",
"## Available statistics\n",
"\n",
"Here are the statistics that can be computed using `tskit`,\n",
"grouped by basic classification and type. Single-site statistics are ones that are\n",
"averages across sites in windows of the genome, returning numpy arrays whose\n",
"dimensions are determined by the parameters (see {ref}`sec_stats_output_dimensions`).\n",
"\n",
"{ref}`sec_stats_sample_sets_one_way` are defined over a single sample set, \n",
"whereas {ref}`sec_stats_sample_sets_multi_way` compare 2 or more sets of samples.\n",
"\n",
"Some of the methods below benefit from a little extra discussion, provided in the\n",
"{ref}`sec_stats_notes` section at the end of this chapter: if so, a link to the note\n",
"appears beside the listed method.\n",
"\n",
"* Single site\n",
" * One-way\n",
" * {meth}`~TreeSequence.allele_frequency_spectrum` (see {ref}`notes`)\n",
" * {meth}`~TreeSequence.diversity`\n",
" * {meth}`~TreeSequence.segregating_sites`\n",
" * {meth}`~TreeSequence.trait_covariance`\n",
" {meth}`~TreeSequence.trait_correlation`\n",
" {meth}`~TreeSequence.trait_linear_model`\n",
" (see {ref}`sec_stats_notes_trait`)\n",
" * {meth}`~TreeSequence.Tajimas_D` (see {ref}`notes`)\n",
" * Multi-way\n",
" * {meth}`~TreeSequence.divergence`\n",
" * {meth}`~TreeSequence.genetic_relatedness`\n",
" * {meth}`~TreeSequence.f4`\n",
" {meth}`~TreeSequence.f3`\n",
" {meth}`~TreeSequence.f2`\n",
" (see {ref}`sec_stats_notes_f`)\n",
" * {meth}`~TreeSequence.Y3`\n",
" {meth}`~TreeSequence.Y2`\n",
" (see {ref}`sec_stats_notes_y`)\n",
" * {meth}`~TreeSequence.genealogical_nearest_neighbours` (see {ref}`sec_stats_notes_gnn`)\n",
" * {meth}`~TreeSequence.Fst` (see {ref}`sec_stats_notes_derived`)\n",
"* Multi site\n",
" * {meth}`~LdCalculator` (note this is soon to be deprecated)\n",
"\n",
":::{note}\n",
"There is a general framework provided for calculating additional single site\n",
"statistics (see the {ref}`sec_stats_general_api` section). However, the\n",
"pre-implemented statistics in the table above will be faster than rederiving\n",
"them using the general framework directly, so the versions above should be preferred.\n",
":::\n",
"\n",
"\n",
"(sec_stats_examples)=\n",
"\n",
"### Quick examples"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "3b882bfb",
"metadata": {
"tags": [
"hide-input"
]
},
"outputs": [
{
"data": {
"text/markdown": [
"These examples use a tree sequence of 8 samples in 2 populations, with a sequence length of 1000. There are 7 trees and 6 variable sites in the tree sequence."
],
"text/plain": [
""
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from IPython.display import Markdown\n",
"import msprime\n",
"import numpy as np\n",
"\n",
"demography = msprime.Demography()\n",
"demography.add_population(name=\"A\", initial_size=10_000)\n",
"demography.add_population(name=\"B\", initial_size=10_000)\n",
"demography.set_symmetric_migration_rate([\"A\", \"B\"], 0.001)\n",
"ts = msprime.sim_ancestry(\n",
" samples={\"A\": 2, \"B\": 2},\n",
" sequence_length=1000,\n",
" demography=demography,\n",
" recombination_rate=2e-8,\n",
" random_seed=12)\n",
"ts = msprime.sim_mutations(ts, rate=2e-8, random_seed=12)\n",
"Markdown(\n",
" f\"These examples use a tree sequence of {ts.num_samples} samples \"\n",
" f\"in {ts.num_populations} populations, \"\n",
" f\"with a sequence length of {int(ts.sequence_length)}. \"\n",
" f\"There are {ts.num_trees} trees and \"\n",
" f\"{ts.num_sites} variable sites in the tree sequence.\"\n",
")"
]
},
{
"cell_type": "markdown",
"id": "a6e0a657",
"metadata": {},
"source": [
"#### Basic calling convention"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "1a5b6f6c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.001964285714285714\n"
]
}
],
"source": [
"pi = ts.diversity()\n",
"print(pi) # Genetic diversity within the sample set"
]
},
{
"cell_type": "markdown",
"id": "b99c8436",
"metadata": {},
"source": [
"#### Restrict to {ref}`sample sets`"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "25c1f189",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.0013333333333333333\n"
]
}
],
"source": [
"pi_0 = ts.diversity(sample_sets=ts.samples(population=0))\n",
"print(pi_0) # Genetic diversity within population 0"
]
},
{
"cell_type": "markdown",
"id": "3c0a59bb",
"metadata": {},
"source": [
"#### Summarise in genomic {ref}`windows`"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "2eea8eb3",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[0.00125 0.0025 0.00375]\n"
]
}
],
"source": [
"pi_window = ts.diversity(sample_sets=ts.samples(population=1), windows=[0, 400, 600, 1000])\n",
"print(pi_window) # Genetic diversity within population 1 in three windows along the genome"
]
},
{
"cell_type": "markdown",
"id": "67f679a7",
"metadata": {},
"source": [
"#### Compare {ref}`between` sample sets"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "ffe47de1",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.002\n"
]
}
],
"source": [
"dxy = ts.divergence(sample_sets=[ts.samples(population=0), ts.samples(population=1)])\n",
"print(dxy) # Av number of differences per bp between samples in population 0 and 1"
]
},
{
"cell_type": "markdown",
"id": "061d5430",
"metadata": {},
"source": [
"#### Change the {ref}`mode`"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "ae49eded",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"75549.0842356655\n"
]
}
],
"source": [
"bl = ts.divergence(\n",
" mode=\"branch\", # Use branch lengths rather than genetic differences\n",
" sample_sets=[ts.samples(population=0), ts.samples(population=1)],\n",
")\n",
"print(bl) # Av branch length separating samples in population 0 and 1"
]
},
{
"cell_type": "markdown",
"id": "bc9fd215",
"metadata": {},
"source": [
"(sec_stats_single_site)=\n",
"\n",
"## Single site statistics\n",
"\n",
"\n",
"(sec_stats_interface)=\n",
"\n",
"### Interface\n",
"\n",
"Tskit offers a powerful and flexible interface for computing population genetic\n",
"statistics. Consequently, the interface is a little complicated and there are a\n",
"lot of options. However, we provide sensible defaults for these options and\n",
"`tskit` should do what you want in most cases. There are several major options\n",
"shared by many statistics, which we describe in detail in the following subsections:\n",
"\n",
"{ref}`sec_stats_mode`\n",
": What are we summarising information about?\n",
"\n",
"{ref}`sec_stats_windows`\n",
": What section(s) of the genome are we interested in?\n",
"\n",
"{ref}`sec_stats_span_normalise`\n",
": Should the statistic calculated for each window be normalised by the span\n",
" (i.e. the sequence length) of that window?\n",
"\n",
"The statistics functions are highly efficient and are based where possible\n",
"on numpy arrays. Each of these statistics will return the results as a numpy\n",
"array, and the format of this array will depend on the statistic being\n",
"computed (see the {ref}`sec_stats_output_format` section for details).\n",
"A convenient feature of the statistics API is that the dimensions of the\n",
"output array is defined in a simple and intuitive manner by the\n",
"parameters provided. The {ref}`sec_stats_output_dimensions` section\n",
"defines the rules that are used.\n",
"\n",
"Please see the {ref}`tutorial ` for examples of the\n",
"statistics API in use.\n",
"\n",
"\n",
"(sec_stats_mode)=\n",
"\n",
"#### Mode\n",
"\n",
"There are three **modes** of statistic: `site`, `branch`, and `node`,\n",
"that each summarize aspects of the tree sequence in different but related ways.\n",
"Roughly speaking, these answer the following sorts of question:\n",
"\n",
"site\n",
": How many mutations differentiate these two genomes?\n",
"\n",
"branch\n",
": How long since these genomes' common ancestor?\n",
"\n",
"node\n",
": On how much of the genome is each node an ancestor of only one of these genomes, but not both?\n",
"\n",
"These three examples can all be answered in the same way with the tree sequence:\n",
"first, draw all the paths from one genome to the other through the tree sequence\n",
"(back up to their common ancestor and back down in each marginal tree).\n",
"Then,\n",
"(`site`) count the number of mutations falling on the paths,\n",
"(`branch`) measure the length of the paths, or\n",
"(`node`) count how often the path goes through each node.\n",
"There is more discussion of this correspondence in the paper describing these statistics,\n",
"and precise definitions are given in each statistic.\n",
"\n",
"Here's an example of using the {meth}`~TreeSequence.diversity` statistic to return the\n",
"average branch length between all pairs of samples:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "6b53dd2e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"74860.28301677053"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ts.diversity(mode=\"branch\")"
]
},
{
"cell_type": "markdown",
"id": "c324b96a",
"metadata": {},
"source": [
"One important thing to know is that `node` statistics have somewhat different output.\n",
"While `site` and `branch` statistics naturally return one number\n",
"for each portion of the genome (and thus incorporates information about many nodes: see below),\n",
"the `node` statistics return one number **for each node** in the tree sequence (and for each window).\n",
"There can be a lot of nodes in the tree sequence, so beware.\n",
"\n",
"Also remember that in a tree sequence the \"sites\" are usually just the **variant** sites,\n",
"e.g., the sites of the SNPs. Although the tree sequence may in principle have monomorphic\n",
"sites, those produced by simulation usually don't.\n",
"\n",
"\n",
"(sec_stats_sample_sets)=\n",
"\n",
"#### Sample sets and indexes\n",
"\n",
"Many standard population genetics statistics\n",
"are defined with respect to some number of groups of genomes,\n",
"usually called \"populations\".\n",
"(However, it's clear from the correspondence to descriptors of tree shape\n",
"that the definitions can usefully describe *something*\n",
"even if the groups of samples don't come from \"separate populations\" in some sense.)\n",
"Basically, statistics defined in terms of sample sets can use the frequency of any allele\n",
"in each of the sample sets when computing the statistic.\n",
"For instance, nucleotide divergence is defined for a *pair* of groups of samples,\n",
"so if you wanted to compute pairwise divergences between some groups of samples,\n",
"you'd specify these as your `sample_sets`.\n",
"Then, if `p[i]` is the derived allele frequency in sample set `i`,\n",
"under the hood we (essentially) compute the divergence between sample sets `i` and `j`\n",
"by averaging `p[i] * (1 - p[j]) + (1 - p[i]) * p[j]` across the genome.\n",
"\n",
"Concretely, `sample_sets` specifies the IDs of the nodes to compute statistics of.\n",
"Importantly, these nodes must be {ref}`samples `.\n",
"\n",
"Here's an example of calculating the average\n",
"{meth}`genetic diversity` within a specific population:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "f1e67c30",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.0013333333333333333"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ts.diversity(sample_sets=ts.samples(population=0))"
]
},
{
"cell_type": "markdown",
"id": "b7c43509",
"metadata": {},
"source": [
"So, what if you\n",
"have samples from each of 10 populations,\n",
"and want to compute **all** fourty-five pairwise divergences?\n",
"You could call `divergence` fourty-five times, but this would be tedious\n",
"and also inefficient, because the allele frequencies for one population\n",
"gets used in computing many of those values.\n",
"So, statistics that take a `sample_sets` argument also take an `indexes` argument,\n",
"which for a statistic that operates on `k` sample sets will be a list of `k`-tuples.\n",
"If `indexes` is a length `n` list of `k`-tuples,\n",
"then the output will have `n` columns,\n",
"and if `indexes[j]` is a tuple `(i0, ..., ik)`,\n",
"then the `j`-th column will contain values of the statistic computed on\n",
"`(sample_sets[i0], sample_sets[i1], ..., sample_sets[ik])`.\n",
"\n",
"\n",
"How multiple statistics are handled differs slightly between statistics\n",
"that operate on single sample sets and multiple sample sets.\n",
"\n",
"\n",
"(sec_stats_sample_sets_one_way)=\n",
"\n",
"##### One-way methods\n",
"\n",
"One-way statistics such as {meth}`TreeSequence.diversity` are defined over a single\n",
"sample set. For these methods, `sample_sets` is interpreted in the following way:\n",
"\n",
"- If it is a single list of node IDs (e.g., `sample_sets=[0, 1 ,2]`), this is\n",
" interpreted as running the calculation over one sample set and we remove\n",
" the last dimension in the result array as described in the\n",
" {ref}`sec_stats_output_dimensions` section.\n",
"\n",
"- If it is `None` (the default), this is equivalent to `sample_sets=ts.samples()`,\n",
" and we therefore compute the statistic over all samples in the tree sequence. **Note\n",
" that we also drop the outer dimension of the result array in this case**.\n",
"\n",
"- If it is a list of lists of samples we return an array for each window in the output,\n",
" which contains the value of the statistic separately for each of `sample_sets`\n",
" in the order they are given.\n",
"\n",
"\n",
"(sec_stats_sample_sets_multi_way)=\n",
"\n",
"##### Multi-way methods\n",
"\n",
"Multi-way statistics such as {meth}`TreeSequence.divergence` are defined over a\n",
"`k` sample sets. In this case, `sample_sets` must be a list of lists of sample IDs,\n",
"and there is no default. For example, this finds the average\n",
"{meth}`genetic divergence` between samples in populations\n",
"0 and 1"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "4459ccc2",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.002"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ts.divergence(\n",
" sample_sets=[\n",
" ts.samples(population=0),\n",
" ts.samples(population=1),\n",
" ]\n",
")"
]
},
{
"cell_type": "markdown",
"id": "14288a5e",
"metadata": {},
"source": [
"The `indexes` parameter is interpreted in the following way:\n",
"\n",
"- If it is a single `k`-tuple, this is interpreted as computing a single\n",
" statistic selecting the specified sample sets and we remove the last dimension\n",
" in the result array as described in the {ref}`sec_stats_output_dimensions` section.\n",
"\n",
"- If if is `None` and `sample_sets` contains exactly `k` sample sets,\n",
" this is equivalent to `indexes=range(k)`. **Note\n",
" that we also drop the outer dimension of the result array in this case**.\n",
"\n",
"- If is is a list of `k`-tuples (each consisting of integers\n",
" of integers between `0` and `len(sample_sets) - 1`) of length `n` we\n",
" compute `n` statistics based on these selections of sample sets.\n",
"\n",
"\n",
"(sec_stats_windows)=\n",
"\n",
"#### Windows\n",
"\n",
"Each statistic has an argument, `windows`,\n",
"which defines a collection of contiguous windows spanning the genome.\n",
"`windows` should be a list of `n+1` increasing numbers beginning with 0\n",
"and ending with the `sequence_length`.\n",
"The statistic will be computed separately in each of the `n` windows,\n",
"and the `k`-th row of the output will report the values of the statistic\n",
"in the `k`-th window, i.e., from (and including) `windows[k]` to\n",
"(but not including) `windows[k+1]`. For example, this calculates\n",
"{meth}`Tajima's D` in four evenly spaced windows along the\n",
"genome:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "94b3a761",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ nan, -1.31009224, 1.16649684, -0.81245539])"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"num_windows = 4\n",
"ts.Tajimas_D(windows=np.linspace(0, ts.sequence_length, num_windows + 1))"
]
},
{
"cell_type": "markdown",
"id": "4721e923",
"metadata": {},
"source": [
"Most windowed statistics by default return **averages** within each of the windows,\n",
"so the values are comparable between windows, even of different spans.\n",
"(However, shorter windows may be noisier.)\n",
"Suppose for instance that you compute some statistic with `windows = [0, a, b]`\n",
"for some valid positions `0 < a < b`,\n",
"and get an output array `S` with two rows.\n",
"Then, computing the same statistic with `windows = [0, b]`\n",
"would be equivalent to averaging the rows of `S`,\n",
"obtaining `((a - 0) * S[0] + (b - a) * S[1]) / (b - 0)`.\n",
"\n",
"There are some shortcuts to other useful options:\n",
"\n",
"`windows = None`\n",
" This is the default and computes statistics in single window over the whole\n",
" sequence. As the first returned array contains only a single\n",
" value, we drop this dimension as described in the\n",
" {ref}`output dimensions ` section. **Note:** if you\n",
" really do want to have an array with a single value as the result, please use\n",
" `windows = [0.0, ts.sequence_length]`.\n",
"\n",
"`windows = \"trees\"`\n",
" This says that you want statistics computed separately on the portion of the genome\n",
" spanned by each tree, so is equivalent to passing `windows = ts.breakpoints()`.\n",
" (Beware: there can be a lot of trees!)\n",
"\n",
"`windows = \"sites\"`\n",
" This says to output one set of values for **each site**,\n",
" and is equivalent to passing `windows = [s.position for s in ts.sites()] + [ts.sequence_length]`.\n",
" This will return one statistic for each site (beware!);\n",
" since the windows are all different sizes you probably want to also pass\n",
" `span_normalise=False` (see below).\n",
"\n",
"\n",
"(sec_stats_span_normalise)=\n",
"\n",
"#### Span normalise\n",
"\n",
"In addition to windowing there is an option, `span_normalise` (which defaults to `True`),\n",
"All the primary statistics defined here are *sums* across locations in the genome:\n",
"something is computed for each position, and these values are added up across all positions in each window.\n",
"Whether the total span of the window is then taken into account is determined by the option `span_normalise`:\n",
"if it is `True` (the default), the sum for each window is converted into an *average*,\n",
"by dividing by the window's *span* (i.e. the length of genome that it covers).\n",
"Otherwise, the sum itself is returned.\n",
"The default is `span_normalise=True`,\n",
"because this makes the values comparable across windows of different sizes.\n",
"To make this more concrete: {meth}`pairwise sequence divergence <.TreeSequence.divergence>`\n",
"between two samples with `mode=\"site\"` is the density of sites that differ between the samples;\n",
"this is computed for each window by counting up the number of sites\n",
"at which the two differ, and dividing by the total span of the window.\n",
"If we wanted the number of sites at which the two differed in each window,\n",
"we'd calculate divergence with `span_normalise=False`.\n",
"\n",
"Following on from above, suppose we computed the statistic `S` with\n",
"`windows = [0, a, b]` and `span_normalise=True`,\n",
"and then computed `T` in just the same way except with `span_normalise=False`.\n",
"Then `S[0]` would be equal to `T[0] / a` and `S[1] = T[1] / (b - a)`.\n",
"Furthermore, the value obtained with `windows = [0, b]` would be equal to `T[0] + T[1]`.\n",
"However, you probably usually want the (default) normalized version:\n",
"don't get unnormalised values unless you're sure that's what you want.\n",
"The exception is when computing a site statistic with `windows = \"sites\"`:\n",
"this case, computes a statistic with the pattern of genotypes at each site,\n",
"and normalising would divide these statistics by the distance to the previous variant site\n",
"(probably not what you want to do).\n",
"\n",
":::{note}\n",
"The resulting values are scaled \"per unit of sequence length\" - for instance, pairwise\n",
"sequence divergence is measured in \"differences per unit of sequence length\". Functions\n",
"such as {func}`msprime:msprime.sim_mutations` will by default add mutations in discrete\n",
"coordinates, usually interpreted as base pairs, in which\n",
"case span normalised statistics are in units of \"per base pair\".\n",
":::\n",
"\n",
"(sec_stats_output_format)=\n",
"\n",
"#### Output format\n",
"\n",
"Each of the statistics methods returns a `numpy` ndarray.\n",
"Suppose that the output is named `out`.\n",
"If `windows` has been specified, the number of rows of the output is equal to the\n",
"number of windows, so that `out.shape[0]` is equal to `len(windows) - 1`\n",
"and `out[i]` is an array of statistics describing the portion of the tree sequence\n",
"from `windows[i]` to `windows[i + 1]` (including the left but not the right endpoint).\n",
"What is returned within each window depends on the {ref}`mode `:\n",
"\n",
"`mode=\"site\"` or `mode=\"branch\"`\n",
" The output is a two-dimensional array,\n",
" with columns corresponding to the different statistics computed: `out[i, j]` is the `j`-th statistic\n",
" in the `i`-th window.\n",
"\n",
"`mode=\"node\"`\n",
" The output is a three-dimensional array,\n",
" with the second dimension corresponding to node id.\n",
" In other words, `out.shape[1]` is equal to `ts.num_nodes`,\n",
" and `out[i,j]` is an array of statistics computed for node `j` on the `i`-th window.\n",
"\n",
"The final dimension of the arrays in other cases is specified by the method.\n",
"\n",
"Note, however, that empty dimensions can optionally be dropped,\n",
"as described in the {ref}`sec_stats_output_dimensions` section.\n",
"\n",
"A note about **default values** and **division by zero**:\n",
"Under the hood, statistics computation fills in zeros everywhere, then updates these\n",
"(since statistics are all **additive**, this makes sense).\n",
"But now suppose that you've got a statistic that returns `nan`\n",
"(\"not a number\") sometimes, like if you're taking the diversity of a sample set with only `n=1` sample,\n",
"which involves dividing by `n * (n - 1)`.\n",
"Usually, you'll just get `nan` everywhere that the division by zero happens.\n",
"But there's a couple of caveats.\n",
"For `site` statistics, any windows without any sites in them never get touched,\n",
"so they will have a value of 0.\n",
"For `branch` statistics, any windows with **no branches** will similarly remain 0.\n",
"That said, you should **not** rely on the specific behavior of whether `0` or `nan` is returned\n",
"for \"empty\" cases like these: it is subject to change.\n",
"\n",
"\n",
"(sec_stats_output_dimensions)=\n",
"\n",
"#### Output dimensions\n",
"\n",
"In the general case, tskit outputs two dimensional (or three dimensional, in the case of node\n",
"stats) numpy arrays, as described in the {ref}`sec_stats_output_format` section.\n",
"The first dimension corresponds to the window along the genome\n",
"such that for some result array `x`, `x[j]` contains information about the jth window.\n",
"The last dimension corresponds to the statistics being computed, so that `x[j, k]` is the\n",
"value of the kth statistic in the jth window (in the two dimensional case). This is\n",
"a powerful and general interface, but in many cases we will not use this full generality\n",
"and the extra dimensions in the numpy arrays are inconvenient.\n",
"\n",
"Tskit optionally removes empty dimensions from the output arrays following a few\n",
"simple rules.\n",
"\n",
"1. If `windows` is None we are computing over the single window covering the\n",
" full sequence. We therefore drop the first dimension of the array.\n",
"\n",
"2. In one-way stats, if the `sample_sets` argument is a 1D array we interpret\n",
" this as specifying a single sample set (and therefore a single statistic), and\n",
" drop the last dimension of the output array. If `sample_sets` is None\n",
" (the default), we use the sample set `ts.samples()`, invoking\n",
" this rule (we therefore drop the last dimension by default).\n",
"\n",
"3. In k-way stats, if the `indexes` argument is a 1D array of length k\n",
" we intepret this as specifying a single statistic and drop the last\n",
" dimension of the array. If `indexes` is None (the default) and\n",
" there are k sample sets, we compute the statistic from these sample sets\n",
" and drop the last dimension.\n",
"\n",
"4. If, after dropping these dimensions, the dimension is 0, we return a numpy\n",
" scalar (instead of an array of dimension 0).\n",
"\n",
"Rules 2 and 3 can be summarised by \"the dimensions of the input determines\n",
"the dimensions of the output\". Note that dropping these dimensions is\n",
"**optional**: it is always possible to keep the full dimensions of the\n",
"output arrays.\n",
"\n",
"Please see the {ref}`tutorial ` for examples of the\n",
"various output dimension options.\n",
"\n",
"\n",
"(sec_stats_general_api)=\n",
"\n",
"### General API\n",
"\n",
"The methods {meth}`TreeSequence.general_stat` and {meth}`TreeSequence.sample_count_stat`\n",
"provide access to the general-purpose algorithm for computing statistics.\n",
"Here is a bit more discussion of how to use these.\n",
"\n",
"\n",
"(sec_stats_polarisation)=\n",
"\n",
"#### Polarisation\n",
"\n",
"Many statistics calculated from genome sequence treat all alleles on equal footing,\n",
"as one must without knowledge of the ancestral state and sequence of mutations that produced the data.\n",
"Separating out the *ancestral* allele (e.g., as inferred using an outgroup)\n",
"is known as *polarisiation*.\n",
"For instance, in the allele frequency spectrum, a site with alleles at 20% and 80% frequency\n",
"is no different than another whose alleles are at 80% and 20%,\n",
"unless we know in each case which allele is ancestral,\n",
"and so while the unpolarised allele frequency spectrum gives the distribution of frequencies of *all* alleles,\n",
"the *polarised* allele frequency spectrum gives the distribution of frequencies of only *derived* alleles.\n",
"\n",
"This concept is extended to more general statistics as follows.\n",
"For site statistics, summary functions are applied to the total weight or number of samples\n",
"associated with each allele; but if polarised, then the ancestral allele is left out of this sum.\n",
"For branch or node statistics, summary functions are applied to the total weight or number of samples\n",
"below, and above each branch or node; if polarised, then only the weight below is used.\n",
"\n",
"\n",
"(sec_stats_summary_functions)=\n",
"\n",
"#### Summary functions\n",
"\n",
"For convenience, here are the summary functions used for many of the statistics.\n",
"Below, {math}`x` denotes the number of samples in a sample set below a node,\n",
"`n` denotes the total size of a sample set, {math}`p = x / n`,\n",
"and boolean expressions (e.g., {math}`(x > 0)`) are interpreted as 0/1.\n",
"\n",
"`diversity`\n",
": {math}`f(x) = \\frac{x (n - x)}{n (n-1)}`\n",
"\n",
" For an unpolarized statistic with biallelic loci, this calculates\n",
" {math}`2 p (1-p)`.\n",
"\n",
"`segregating_sites`\n",
": {math}`f(x) = (x > 0) (1 - x / n)`\n",
"\n",
" (Note: this works because if {math}`\\sum_i p_1 = 1` then {math}`\\sum_{i=1}^k (1-p_i) = k-1`.)\n",
"\n",
"`Y1`\n",
": {math}`f(x) = \\frac{x (n - x) (n - x - 1)}{n (n-1) (n-2)}`\n",
"\n",
"`divergence`\n",
": {math}`f(x_1, x_2) = \\frac{x_1 (n_2 - x_2)}{n_1 n_2}`,\n",
"\n",
" unless the two indices are the same, when the diversity function is used.\n",
"\n",
" For an unpolarized statistic with biallelic loci, this calculates\n",
" {math}`p_1 (1-p_2) + (1 - p_1) p_2`.\n",
"\n",
"`genetic_relatedness`\n",
": {math}`f(x_i, x_j) = \\frac{1}{2}(x_i - m)(x_j - m)`,\n",
"\n",
" where {math}`m = \\frac{1}{n}\\sum_{k=1}^n x_k` with {math}`n` the total number\n",
" of samples.\n",
"\n",
"`Y2`\n",
": {math}`f(x_1, x_2) = \\frac{x_1 (n_2 - x_2) (n_2 - x_2 - 1)}{n_1 n_2 (n_2 - 1)}`\n",
"\n",
"`f2`\n",
": {math}`f(x_1, x_2) = \\frac{x_1 (x_1 - 1) (n_2 - x_2) (n_2 - x_2 - 1)}{n_1 (n_1 - 1) n_2 (n_2 - 1)} - \\frac{x_1 (n_1 - x_1) (n_2 - x_2) x_2}{n_1 (n_1 - 1) n_2 (n_2 - 1)}`\n",
"\n",
" For an unpolarized statistic with biallelic loci, this calculates\n",
" {math}`((p_1 - p_2)^2 - (p_1 (1-p_2)^2 + (1-p_1) p_2^2)/n_1 - (p_1^2 (1-p_2) + (1-p_1)^2 p_2)/n_2`\n",
" {math}`+ (p_1 p_2 + (1-p_1)(1-p_2))/ n_1 n_2)(1 + \\frac{1}{n_1 - 1})(1 + \\frac{1}{n_2 - 1})`,\n",
" which is the unbiased estimator for {math}`(p_1 - p_2)^2` from a finite sample.\n",
"\n",
"`Y3`\n",
": {math}`f(x_1, x_2, x_3) = \\frac{x_1 (n_2 - x_2) (n_3 - x_3)}{n_1 n_2 n_3}`\n",
"\n",
"`f3`\n",
": {math}`f(x_1, x_2, x_3) = \\frac{x_1 (x_1 - 1) (n_2 - x_2) (n_3 - x_3)}{n_1 (n_1 - 1) n_2 n_3} - \\frac{x_1 (n_1 - x_1) (n_2 - x_2) x_3}{n_1 (n_1 - 1) n_2 n_3}`\n",
"\n",
" For an unpolarized statistic with biallelic loci, this calculates\n",
" {math}`((p_1 - p_2)(p_1 - p_3) - p_1 (1-p_2)(1-p_3)/n_1 - (1-p_1) p_2 p_3/n_1)(1 + \\frac{1}{n_1 - 1})`,\n",
" which is the unbiased estimator for {math}`(p_1 - p_2)(p_1 - p_3)` from a finite sample.\n",
"\n",
"`f4`\n",
": {math}`f(x_1, x_2, x_3, x_4) = \\frac{x_1 x_3 (n_2 - x_2) (n_4 - x_4)}{n_1 n_2 n_3 n_4} - \\frac{x_1 x_4 (n_2 - x_2) (n_3 - x_3)}{n_1 n_2 n_3 n_4}`\n",
"\n",
" For an unpolarized statistic with biallelic loci, this calculates\n",
" {math}`(p_1 - p_2)(p_3 - p_4)`.\n",
"\n",
"`trait_covariance`\n",
": {math}`f(w) = \\frac{w^2}{2 (n-1)^2}`,\n",
"\n",
" where {math}`w` is the sum of all trait values of the samples below the node.\n",
"\n",
"`trait_correlation`\n",
": {math}`f(w, x) = \\frac{w^2}{2 x (1 - x/n) (n - 1)}`,\n",
"\n",
" where as before {math}`x` is the total number of samples below the node,\n",
" and {math}`n` is the total number of samples.\n",
"\n",
"`trait_linear_model`\n",
": {math}`f(w, z, x) = \\frac{1}{2}\\left( \\frac{w - \\sum_{j=1}^k z_j v_j}{x - \\sum_{j=1}^k z_j^2} \\right)^2`,\n",
"\n",
" where {math}`w` and {math}`x` are as before,\n",
" {math}`z_j` is the sum of the j-th normalised covariate values below the node,\n",
" and {math}`v_j` is the covariance of the trait with the j-th covariate.\n",
"\n",
"\n",
"## Multi site statistics\n",
"\n",
":::{todo}\n",
"Document statistics that use information about correlation between sites, such as\n",
"LdCalculator (and perhaps reference {ref}`sec_identity`). Note that if we have a general\n",
"framework which has the same calling conventions as the single site stats,\n",
"we can rework the sections above.\n",
":::\n",
"\n",
"\n",
"(sec_stats_notes)=\n",
"\n",
"## Notes\n",
"\n",
"\n",
"(sec_stats_notes_afs)=\n",
"\n",
"### Allele frequency spectrum\n",
"\n",
"Most single site statistics are based on the summaries of the allele frequency spectra\n",
"(AFS). The `tskit` AFS interface includes windowed and joint spectra,\n",
"using the same general pattern as other statistics,\n",
"but some of the details about how it is defined,\n",
"especially in the presence of multiple alleles per site, need to be explained.\n",
"If all sites are biallelic, then the result is just as you'd expect:\n",
"see the method documentation at {meth}`~TreeSequence.allele_frequency_spectrum` \n",
"for the description.\n",
"Note that with `mode=\"site\"`, we really do tabulate *allele* counts:\n",
"if more than one mutation on different parts of the tree produce the same allele,\n",
"it is the total number with this allele (i.e., inheriting *either* mutation)\n",
"that goes into the AFS.\n",
"The AFS with `mode=\"branch\"` is the expected value for the Site AFS\n",
"with infinite-sites, biallelic mutation, so there is nothing surprising there,\n",
"either.\n",
"\n",
"But, how do we deal with sites at which there are more than two alleles?\n",
"At each site, we iterate over the distinct alleles at that site,\n",
"and for each allele, count how many samples in each sample set\n",
"have inherited that allele.\n",
"For a concrete example, suppose that we are computing the AFS of a single\n",
"sample set with 10 samples, and are considering a site with three alleles:\n",
"*a*, *b*, and *c*,\n",
"which have been inherited by 6, 3, and 1 samples, respectively,\n",
"and that allele *a* is ancestral.\n",
"What we do at this site depends on if the AFS is polarised or not.\n",
"\n",
"If we are computing the *polarised* AFS,\n",
"we add 1 to each entry of the output corresponding to each allele count\n",
"*except* the ancestral allele.\n",
"In our example, we'd add 1 to both `AFS[3]` and `AFS[1]`.\n",
"This means that the sum of all entries of a polarised, site AFS\n",
"should equal the total number of non-ancestral alleles in the tree sequence\n",
"that are ancestral to at least one of the samples in the tree sequence\n",
"but not ancestral to all of them.\n",
"The reason for this last caveat is that like with most statistics,\n",
"mutations that are not ancestral to *any* samples (not just those in the sample sets)\n",
"are not counted (and so don't even enter into `AFS[0]`),\n",
"and similarly for those alleles inherited by *all* samples.\n",
"\n",
"Now, if we are computing the *unpolarised* AFS,\n",
"we add *one half* to each entry of the *folded* output\n",
"corresponding to each allele count *including* the ancestral allele.\n",
"What does this mean?\n",
"Well, `polarised=False` means that we cannot distinguish between an\n",
"allele count of 6 and an allele count of 4.\n",
"So, *folding* means that we would add our allele that is seen in 6 samples\n",
"to `AFS[4]` instead of `AFS[6]`.\n",
"So, in total, we will add 0.5 to each of `AFS[4]`, `AFS[3]`, and `AFS[1]`.\n",
"This means that the sum of an unpolarised AFS\n",
"will be equal to the total number of alleles that are inherited\n",
"by any of the samples in the tree sequence, divided by two.\n",
"Why one-half? Well, notice that if in fact the mutation that produced the *b*\n",
"allele had instead produced an *a* allele,\n",
"so that the site had only two alleles, with frequencies 7 and 3.\n",
"Then, we would have added 0.5 to `AFS[3]` *twice*.\n",
"\n",
"\n",
"(sec_stats_notes_trait)=\n",
"\n",
"### Trait correlations\n",
"\n",
"{meth}`~TreeSequence.trait_covariance`, {meth}`~TreeSequence.trait_correlation`, and\n",
"{meth}`~TreeSequence.trait_linear_model` compute correlations and covariances of traits\n",
"(i.e., an arbitrary vector) with allelic state, possibly in the context of a multivariate\n",
"linear model with other covariates (as in GWAS).\n",
"\n",
"\n",
"(sec_stats_notes_f)=\n",
"\n",
"### Patterson's f statistics\n",
"\n",
"{meth}`~TreeSequence.f4`, {meth}`~TreeSequence.f3`, and {meth}`~TreeSequence.f2`\n",
"are the `f` statistics (also called `F` statistics) introduced in\n",
"[Reich et al (2009)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2842210/).\n",
"See the documentation (link below) for the definition,\n",
"and [Peter (2016)](https://www.genetics.org/content/202/4/1485) for readable\n",
"discussion of their use.\n",
"\n",
"\n",
"(sec_stats_notes_y)=\n",
"\n",
"### Y statistics\n",
"\n",
"{meth}`~TreeSequence.Y3` and {meth}`~TreeSequence.Y2` are the `Y` statistics introduced\n",
"by [Ashander et al (2018)](https://www.biorxiv.org/content/10.1101/354530v1)\n",
"as a three-sample intermediate between diversity/divergence (which are\n",
"pairwise) and Patterson's f statistics (which are four-way).\n",
"\n",
"\n",
"(sec_stats_notes_derived)=\n",
"\n",
"### Derived statistics\n",
"\n",
"Most statistics have the property that `mode=\"branch\"` and\n",
"`mode=\"site\"` are \"dual\" in the sense that they are equal, on average, under\n",
"a high neutral mutation rate. {meth}`~TreeSequence.Fst` and {meth}`~TreeSequence.Tajimas_D`\n",
"do not have this property (since both are ratios of statistics that do have this property).\n",
"\n",
"\n",
"(sec_stats_notes_gnn)=\n",
"\n",
"### Genealogical nearest neighbours\n",
"\n",
"The {meth}`~TreeSequence.genealogical_nearest_neighbours` statistic is not based on branch\n",
"lengths, but on topologies. therefore it currently has a slightly different interface to\n",
"the other single site statistics. This may be revised in the future."
]
}
],
"metadata": {
"jupytext": {
"text_representation": {
"extension": ".md",
"format_name": "myst",
"format_version": 0.12,
"jupytext_version": "1.9.1"
}
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.16"
},
"source_map": [
12,
98,
122,
126,
129,
133,
136,
140,
143,
147,
150,
154,
160,
230,
232,
270,
272,
325,
332,
365,
368
]
},
"nbformat": 4,
"nbformat_minor": 5
}