Tskit and R

To interface with tskit in R, we can use the reticulate R package, which lets you call Python functions within an R session. In this short tutorial, we’ll go through a couple of examples to show you how to get started. If you haven’t done so already, you’ll need to install reticulate in your R session via install.packages("reticulate").

We’ll begin by simulating a small tree sequence using msprime.

msprime <- reticulate::import("msprime")

ts <- msprime$sim_ancestry(10, sequence_length = 100, random_seed=42)
ts
╔═══════════════════════╗
║TreeSequence           ║
╠═══════════════╤═══════╣
║Trees          │      1║
╟───────────────┼───────╢
║Sequence Length│  100.0║
╟───────────────┼───────╢
║Sample Nodes   │     20║
╟───────────────┼───────╢
║Total Size     │3.6 KiB║
╚═══════════════╧═══════╝
╔═══════════╤════╤═════════╤════════════╗
║Table      │Rows│Size     │Has Metadata║
╠═══════════╪════╪═════════╪════════════╣
║Edges      │  38│  1.0 KiB│          No║
╟───────────┼────┼─────────┼────────────╢
║Individuals│  10│172 Bytes│          No║
╟───────────┼────┼─────────┼────────────╢
║Migrations │   0│  4 Bytes│          No║
╟───────────┼────┼─────────┼────────────╢
║Mutations  │   0│  8 Bytes│          No║
╟───────────┼────┼─────────┼────────────╢
║Nodes      │  39│940 Bytes│          No║
╟───────────┼────┼─────────┼────────────╢
║Populations│   1│216 Bytes│         Yes║
╟───────────┼────┼─────────┼────────────╢
║Provenances│   1│960 Bytes│          No║
╟───────────┼────┼─────────┼────────────╢
║Sites      │   0│  8 Bytes│          No║
╚═══════════╧════╧═════════╧════════════╝

reticulate allows us to access a Python object’s attributes or call its methods via the $ operator. For example, we can access (and assign to a variable) the number of samples in the tree sequence:

n <- ts$num_samples
n
20

We can also use tskit’s powerful Statistics framework to efficiently compute many different summary statistics from a tree sequence. To illustrate this, we’ll first add some mutations to our tree sequence with msprime’s sim_mutations function, and then compute the genetic diversity for each of the tree sequence’s sample nodes:

ts_mut = msprime$sim_mutations(ts, rate = 1e-2, random_seed = 42)
ts_mut$num_mutations
ts_mut$diversity()
14
0.0232631578947368

As a final example, we can also use the tree sequence genotype_matrix() method to return the genotypes of the the tree sequence as a matrix object in R.

G = ts_mut$genotype_matrix()
G
A matrix: 13 × 20 of type int
00010000000000000000
11101111111111111111
11101000001100110010
11101111101111111111
11101111001101110010
00000111000001000000
11101111111111111111
11101111111111111111
00010000000000000000
00000000000000010000
11111111111111111111
00000000000001000000
00010000000000000000

We can then use R functions directly on the genotype matrix:

allele_frequency = rowMeans(G)
allele_frequency
  1. 0.05
  2. 0.95
  3. 0.45
  4. 0.9
  5. 0.65
  6. 0.2
  7. 0.95
  8. 0.95
  9. 0.05
  10. 0.05
  11. 1
  12. 0.05
  13. 0.05

It’s as simple as that! Be sure to check out the reticulate documentation, in particular on Calling Python from R, which includes important information on how R data types are converted to their equivalent Python types.