Tskit and R
Tskit and R#
To interface with tskit
in R, we can use the reticulate R package, which lets you call Python functions within an R session. In this short tutorial, we’ll go through a couple of examples to show you how to get started. If you haven’t done so already, you’ll need to install reticulate
in your R session via install.packages("reticulate")
.
We’ll begin by simulating a small tree sequence using msprime
.
msprime <- reticulate::import("msprime")
ts <- msprime$sim_ancestry(10, sequence_length = 100, random_seed=42)
ts
╔═══════════════════════════╗
║TreeSequence ║
╠═══════════════╤═══════════╣
║Trees │ 1║
╟───────────────┼───────────╢
║Sequence Length│ 100║
╟───────────────┼───────────╢
║Time Units │generations║
╟───────────────┼───────────╢
║Sample Nodes │ 20║
╟───────────────┼───────────╢
║Total Size │ 4.1 KiB║
╚═══════════════╧═══════════╝
╔═══════════╤════╤═════════╤════════════╗
║Table │Rows│Size │Has Metadata║
╠═══════════╪════╪═════════╪════════════╣
║Edges │ 38│ 1.2 KiB│ No║
╟───────────┼────┼─────────┼────────────╢
║Individuals│ 10│304 Bytes│ No║
╟───────────┼────┼─────────┼────────────╢
║Migrations │ 0│ 8 Bytes│ No║
╟───────────┼────┼─────────┼────────────╢
║Mutations │ 0│ 16 Bytes│ No║
╟───────────┼────┼─────────┼────────────╢
║Nodes │ 39│ 1.1 KiB│ No║
╟───────────┼────┼─────────┼────────────╢
║Populations│ 1│224 Bytes│ Yes║
╟───────────┼────┼─────────┼────────────╢
║Provenances│ 1│950 Bytes│ No║
╟───────────┼────┼─────────┼────────────╢
║Sites │ 0│ 16 Bytes│ No║
╚═══════════╧════╧═════════╧════════════╝
reticulate
allows us to access a Python object’s attributes or call its methods via the $
operator. For example, we can access (and assign to a variable) the number of samples in the tree sequence:
n <- ts$num_samples
n
We can also use tskit
’s powerful Statistics framework to efficiently compute many different summary statistics from a tree sequence. To illustrate this, we’ll first add some mutations to our tree sequence with msprime
’s sim_mutations
function, and then compute the genetic diversity for each of the tree sequence’s sample nodes:
ts_mut = msprime$sim_mutations(ts, rate = 1e-2, random_seed = 42)
ts_mut$num_mutations
ts_mut$diversity()
As a final example, we can also use the tree sequence genotype_matrix()
method to return the genotypes of the the tree sequence as a matrix object in R.
G = ts_mut$genotype_matrix()
G
0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 |
1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 |
0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
We can then use R functions directly on the genotype matrix:
allele_frequency = rowMeans(G)
allele_frequency
- 0.05
- 0.95
- 0.45
- 0.9
- 0.65
- 0.2
- 0.95
- 0.95
- 0.05
- 0.05
- 1
- 0.05
- 0.05
It’s as simple as that! Be sure to check out the reticulate documentation, in particular on Calling Python from R, which includes important information on how R data types are converted to their equivalent Python types.