What's new in TSKIT 0.5.0!
TSKIT 0.5.0
New record number of contributors? (12!)
! git log 0.4.1..0.5.0 --format="%aN" --reverse | sort | uniq
Ben Jeffery
Bing Guo
Georgia Tsambos
Gertjan Bisschop
Jerome Kelleher
j.guez
Kevin R. Thornton
molpopgen
peter
Peter Ralph
Ruhollah Shemirani
Savita Karthikeyan
savitakartik
Shing Zhan
Yan Wong
Support and wheels for Python 3.10 (Ben Jeffery)
import sys
sys.version
'3.10.5 (main, Jun 11 2022, 16:53:24) [GCC 9.4.0]'
! pip install tskit --no-cache
Collecting tskit
Downloading tskit-0.5.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[2K [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m MB/s[0m eta [36m0:00:01[0m01[0m
[?25hRequirement already satisfied: svgwrite>=1.1.10 in /home/benj/projects/050_notebook/env/lib/python3.10/site-packages (from tskit) (1.4.2)
Requirement already satisfied: numpy>=1.7 in /home/benj/projects/050_notebook/env/lib/python3.10/site-packages (from tskit) (1.23.0)
Requirement already satisfied: jsonschema>=3.0.0 in /home/benj/projects/050_notebook/env/lib/python3.10/site-packages (from tskit) (4.6.1)
Requirement already satisfied: attrs>=17.4.0 in /home/benj/projects/050_notebook/env/lib/python3.10/site-packages (from jsonschema>=3.0.0->tskit) (21.4.0)
Requirement already satisfied: pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 in /home/benj/projects/050_notebook/env/lib/python3.10/site-packages (from jsonschema>=3.0.0->tskit) (0.18.1)
Installing collected packages: tskit
Successfully installed tskit-0.5.0
import tskit
print(tskit.__version__)
0.5.0
New suite of tree balance metrics, implemented in C - Jeremy Guez and Jerome Kelleher
#First make some trees to demo
comb = tskit.Tree.generate_comb(6)
from IPython.display import display, SVG
SVG(comb.draw_svg())
balanced = tskit.Tree.generate_balanced(6, arity=2)
SVG(balanced.draw_svg())
# The Sackin's index is the sum of the number of ancestors for each leaf of the tree.
# The less balanced a tree is the larger its Sackin index.
comb.sackin_index(), balanced.sackin_index()
(20, 16)
# The Colless index is the sum of abs(L-R) at each node of the tree where L(R) is the size
# of the left(right) descendant clade at the node.
# The less balanced a tree is the larger its Colless index.
comb.colless_index(), balanced.colless_index()
(10, 2)
# The classic B1 and B2 indexes from (Shao & Sokal, 1990)
comb.b1_index(), balanced.b1_index()
(2.083333333333333, 3.0)
comb.b2_index(), balanced.b2_index()
(0.5832456165989635, 0.7525749891599529)
The Tree
class now has arrays for: (Gertjan Bisschop)
- the number of children for each node
- the id of the edge from each node to its parent
# Note that these include an entry for the virtual root at the end of the array!
comb.num_children_array
array([0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 1], dtype=int32)
comb.edge_array
array([ 8, 6, 4, 2, 0, 1, 3, 5, 7, 9, -1, -1], dtype=int32)
# And there are methods to get a single value.
comb.edge(4)
0
# Only use these for clarity, don't use these in a tight loop!
big_comb = tskit.Tree.generate_comb(60000)
%timeit array = big_comb.num_children_array
81.6 ns ± 1.07 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
array = big_comb.num_children_array
%timeit val = array[34567]
95.1 ns ± 10.6 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
%timeit big_comb.num_children(34567)
170 ns ± 1.45 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
New convenience methods to get the time and populations of all individuals in a tree sequence (Peter Ralph)
import msprime
ts = msprime.sim_ancestry(10)
demography = msprime.Demography()
demography.add_population(name="A", initial_size=20)
demography.add_population(name="B", initial_size=5)
demography.add_population(name="C", initial_size=1)
demography.add_population_split(time=1000, derived=["A", "B"], ancestral="C")
ts = msprime.sim_ancestry(samples={"A": 5, "B": 5}, demography=demography, random_seed=12)
ts.individuals_time,ts.individuals_population
(array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]),
array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1], dtype=int32))
Three new topology operations have been added so that topology older than a given time can be removed (Jerome Kelleher)
ts = tskit.Tree.generate_balanced(3, branch_length=1).tree_sequence
SVG(ts.draw_svg(min_time=0, max_time=2))
SVG(ts.decapitate(0.5).draw_svg(min_time=0, max_time=2))
SVG(ts.split_edges(0.5).draw_svg())
# Decapitate is implemented via split_edges and delete_older
tables = ts.tables
tables.delete_older(0.5)
len(tables.edges)
0
Fix negative time handling in SVG drawing and add min_time
to draw_svg
(Yan Wong)
SVG(ts.draw_svg(min_time=-5)) # (Can also be "tree" or "ts")
Loading and dumping of tables and tree-sequences is now a zero-copy operation. This means you’ll be able to save and load larger files for your given RAM. It’s also faster! (Ben Jeffery)
(See images that the notebook wouldn’t let me embed due to CORS)
#### Edge diffs can now be iterated over in the reverse direction (Jerome Kelleher, Ben Jeffery)
ts = msprime.sim_ancestry(2, sequence_length=10,recombination_rate=1)
list(ts.edge_diffs())[0]
EdgeDiff(interval=Interval(left=0.0, right=1.0), edges_out=[], edges_in=[Edge(left=0.0, right=1.0, parent=9, child=0, metadata=b'', id=12), Edge(left=0.0, right=2.0, parent=9, child=3, metadata=b'', id=14), Edge(left=0.0, right=1.0, parent=17, child=1, metadata=b'', id=30), Edge(left=0.0, right=1.0, parent=17, child=2, metadata=b'', id=31), Edge(left=0.0, right=1.0, parent=22, child=9, metadata=b'', id=45), Edge(left=0.0, right=1.0, parent=22, child=17, metadata=b'', id=46)])
list(ts.edge_diffs(direction=tskit.REVERSE))[0]
EdgeDiff(interval=Interval(left=9.0, right=10.0), edges_out=[], edges_in=[Edge(left=8.0, right=10.0, parent=5, child=0, metadata=b'', id=2), Edge(left=8.0, right=10.0, parent=5, child=1, metadata=b'', id=3), Edge(left=9.0, right=10.0, parent=10, child=2, metadata=b'', id=15), Edge(left=8.0, right=10.0, parent=10, child=5, metadata=b'', id=16), Edge(left=9.0, right=10.0, parent=16, child=3, metadata=b'', id=28), Edge(left=9.0, right=10.0, parent=16, child=10, metadata=b'', id=29)])
Note that this change bought edge_diff
into Python, so there is a small ~10% perf hit. Please let us know if this has impacted you.
Instead of iterating over all variants when decoding genotypes, specific sites can be specified. This still uses tree seeking under the hood, so it’s more efficient to access sites in order. (Ben Jeffery)
ts = msprime.sim_ancestry(100, sequence_length=10000,recombination_rate=0.01,random_seed=42)
ts = msprime.sim_mutations(ts, rate=0.01,random_seed=42)
ts
Tree Sequence | |
---|---|
Trees | 1940 |
Sequence Length | 10000.0 |
Time Units | generations |
Sample Nodes | 200 |
Total Size | 509.9 KiB |
Metadata | No Metadata |
Table | Rows | Size | Has Metadata |
---|---|---|---|
Edges | 8097 | 253.0 KiB | |
Individuals | 100 | 2.8 KiB | |
Migrations | 0 | 8 Bytes | |
Mutations | 2399 | 86.7 KiB | |
Nodes | 1860 | 50.9 KiB | |
Populations | 1 | 224 Bytes | ✅ |
Provenances | 2 | 1.6 KiB | |
Sites | 2108 | 51.5 KiB |
v = tskit.Variant(ts)
v.genotypes
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [30], in <cell line: 2>()
1 v = tskit.Variant(ts)
----> 2 v.genotypes
File ~/projects/050_notebook/env/lib/python3.10/site-packages/tskit/genotypes.py:156, in Variant.genotypes(self)
150 @property
151 def genotypes(self) -> np.ndarray:
152 """
153 An array of indexes into the list ``alleles``, giving the
154 state of each sample at the current site.
155 """
--> 156 self._check_decoded()
157 return self._ll_variant.genotypes
File ~/projects/050_notebook/env/lib/python3.10/site-packages/tskit/genotypes.py:120, in Variant._check_decoded(self)
118 def _check_decoded(self):
119 if self._ll_variant.site_id == tskit.NULL:
--> 120 raise ValueError(
121 "This variant has not yet been decoded at a specific site,"
122 "call Variant.decode to set the site."
123 )
ValueError: This variant has not yet been decoded at a specific site,call Variant.decode to set the site.
v.decode(5)
v.alleles, v.genotypes
(('A', 'G'),
array([0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0,
1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0,
0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0,
0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1,
1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0,
1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1,
0, 0], dtype=int32))
v.decode(1)
v.genotypes
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0], dtype=int32)
v.site
Site(id=1, position=4.0, ancestral_state='C', mutations=[Mutation(id=1, site=1, node=300, derived_state='T', parent=-1, metadata=b'', time=0.02619723874402521, edge=822)], metadata=b'')
ts.variants
has a new copy
argument which if False reuses the variant object for improved performance (Ben Jeffery)
%timeit [v.genotypes for v in ts.variants()]
4.54 ms ± 53.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit [v.genotypes for v in ts.variants(copy=False)]
2.76 ms ± 41.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The Mutation
class now has an edge
attribute that stores the id of the edge the mutation falls on (Jerome Kelleher)
ts.edge(v.site.mutations[0].edge)
Edge(left=0.0, right=779.0, parent=494, child=300, metadata=b'', id=822)
VCF output now supports missingness, and when outputting a VCF sites and samples can be masked to be omitted from the output. VCFs can also be returned as a string with as_vcf
(Jerome Kelleher)
import numpy as np
print(ts.num_sites, "sites")
with open("out.vcf", "w") as f:
ts.write_vcf(f, site_mask=np.random.choice([0,1], size=ts.num_sites))
with open("out.vcf", "r") as f:
print(len(f.readlines()), "lines")
2108 sites
1053 lines
print(ts.as_vcf().split("\n")[5])
print(ts.as_vcf().split("\n")[10])
def pick_sites(variant):
return variant.genotypes == 1
print(ts.as_vcf(sample_mask=pick_sites).split("\n")[10])
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT tsk_0 tsk_1 tsk_2 tsk_3 tsk_4 tsk_5 tsk_6 tsk_7 tsk_8 tsk_9 tsk_10 tsk_11 tsk_12 tsk_13 tsk_14 tsk_15 tsk_16 tsk_17 tsk_18 tsk_19 tsk_20 tsk_21 tsk_22 tsk_23 tsk_24 tsk_25 tsk_26 tsk_27 tsk_28 tsk_29 tsk_30 tsk_31 tsk_32 tsk_33 tsk_34 tsk_35 tsk_36 tsk_37 tsk_38 tsk_39 tsk_40 tsk_41 tsk_42 tsk_43 tsk_44 tsk_45 tsk_46 tsk_47 tsk_48 tsk_49 tsk_50 tsk_51 tsk_52 tsk_53 tsk_54 tsk_55 tsk_56 tsk_57 tsk_58 tsk_59 tsk_60 tsk_61 tsk_62 tsk_63 tsk_64 tsk_65 tsk_66 tsk_67 tsk_68 tsk_69 tsk_70 tsk_71 tsk_72 tsk_73 tsk_74 tsk_75 tsk_76 tsk_77 tsk_78 tsk_79 tsk_80 tsk_81 tsk_82 tsk_83 tsk_84 tsk_85 tsk_86 tsk_87 tsk_88 tsk_89 tsk_90 tsk_91 tsk_92 tsk_93 tsk_94 tsk_95 tsk_96 tsk_97 tsk_98 tsk_99
1 14 4 A T . PASS . GT 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 1|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 1|0 1|0 0|0 0|0 0|0 0|0 0|0 1|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 1|0 0|0 0|0 0|0 0|0 0|0 1|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|1 0|0 0|0
1 14 4 A T . PASS . GT 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 .|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 .|0 .|0 0|0 0|0 0|0 0|0 0|0 .|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 .|0 0|0 0|0 0|0 0|0 0|0 .|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|. 0|0 0|0
VCF output now includes the site ID in the VCF ID field (Ruhollah (Roohy) Shemirani)
When metadata equal to b''
is printed to text or HTML tables it will render as an empty string rather than “b’’” (Yan Wong)
tskit.set_print_options(max_lines=10)
ts.tables.nodes
<div>
<style scoped="">
.tskit-table tbody tr th:only-of-type {vertical-align: middle;}
.tskit-table tbody tr th {vertical-align: top;}
.tskit-table tbody td {text-align: right;padding: 0.5em 0.5em;}
.tskit-table tbody th {padding: 0.5em 0.5em;}
</style>
<table border="1" class="tskit-table">
<thead>
<tr>
<th>id</th><th>flags</th><th>population</th><th>individual</th><th>time</th><th>metadata</th>
</tr>
</thead>
<tbody>
<tr><td>0</td><td>1</td><td>0</td><td>0</td><td>0.00000000</td><td></td></tr>
</tbody>
</table>
</div>
tree.mrca
now gives the most common ancestor of any number of nodes (Savita Karthikeyan)
ts.first().mrca(23, 56, 69, 89)
1556
ts.site
now has a position
argument, implemented by binary search (Shing Hei Zhan)
ts.site(position=420)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [43], in <cell line: 1>()
----> 1 ts.site(position=420)
File ~/projects/050_notebook/env/lib/python3.10/site-packages/tskit/trees.py:5409, in TreeSequence.site(self, id_, position)
5407 id_ = site_pos.searchsorted(position)
5408 if id_ >= len(site_pos) or site_pos[id_] != position:
-> 5409 raise ValueError(f"There is no site at position {position}.")
5410 ll_site = self._ll_tree_sequence.get_site(id_)
5411 pos, ancestral_state, ll_mutations, _, metadata = ll_site
ValueError: There is no site at position 420.
ts.site(position=14)
Site(id=4, position=14.0, ancestral_state='A', mutations=[Mutation(id=4, site=4, node=552, derived_state='T', parent=-1, metadata=b'', time=0.043147719803472, edge=2315)], metadata=b'')