What's new in TSKIT 0.4.0!
A quick rundown of what’s coming in 0.4.0, full changelog with links to issues at https://github.com/tskit-dev/tskit/releases/tag/0.4.0
import tskit
import msprime
import numpy as np
tskit.__version__
'0.4.0'
Breaking changes
- In a small but useful change the CLI
info
command now gives more detailed information on the tree sequence (benjeffery)
!tskit info tests/data/SLiM/single-locus-example.trees
╔═══════════════════════╗
║TreeSequence ║
╠═══════════════╤═══════╣
║Trees │ 1║
╟───────────────┼───────╢
║Sequence Length│ 10║
╟───────────────┼───────╢
║Time Units │unknown║
╟───────────────┼───────╢
║Sample Nodes │ 10║
╟───────────────┼───────╢
║Total Size │1.8 KiB║
╚═══════════════╧═══════╝
╔═══════════╤════╤═════════╤════════════╗
║Table │Rows│Size │Has Metadata║
╠═══════════╪════╪═════════╪════════════╣
║Edges │ 12│392 Bytes│ No║
╟───────────┼────┼─────────┼────────────╢
║Individuals│ 5│404 Bytes│ Yes║
╟───────────┼────┼─────────┼────────────╢
║Migrations │ 0│ 8 Bytes│ No║
╟───────────┼────┼─────────┼────────────╢
║Mutations │ 0│ 16 Bytes│ No║
╟───────────┼────┼─────────┼────────────╢
║Nodes │ 15│578 Bytes│ Yes║
╟───────────┼────┼─────────┼────────────╢
║Populations│ 2│112 Bytes│ Yes║
╟───────────┼────┼─────────┼────────────╢
║Provenances│ 1│176 Bytes│ No║
╟───────────┼────┼─────────┼────────────╢
║Sites │ 0│ 16 Bytes│ No║
╚═══════════╧════╧═════════╧════════════╝
# Lets make a tree sequence to use in subsequent examples
ts = msprime.sim_ancestry(20, sequence_length=10)
ts = msprime.sim_mutations(ts,rate=1)
tc = ts.tables
ts
Tree Sequence | |
---|---|
Trees | 1 |
Sequence Length | 10.0 |
Time Units | unknown |
Sample Nodes | 40 |
Total Size | 14.4 KiB |
Metadata | No Metadata |
Table | Rows | Size | Has Metadata |
---|---|---|---|
Edges | 78 | 2.4 KiB | |
Individuals | 20 | 584 Bytes | |
Migrations | 0 | 8 Bytes | |
Mutations | 180 | 6.5 KiB | |
Nodes | 79 | 2.2 KiB | |
Populations | 1 | 224 Bytes | ✅ |
Provenances | 2 | 1.6 KiB | |
Sites | 10 | 266 Bytes |
- In a big change 64 bits are now used to store the sizes of ragged table columns such as metadata,
allowing them to hold more data. This change is fully backwards and forwards compatible
for all tree-sequences whose ragged column sizes fit into 32 bits. New tree-sequences with
large offset arrays that require 64 bits will fail to load in previous versions with
error
_tskit.FileFormatError: An incompatible type for a column was found in the file
. (jeromekelleher, benjeffery)
# This means the dtypes for offset columns has changed
tc.nodes.metadata_offset.dtype
dtype('uint64')
- To make tree traversals in the presence of multiple roots easier, the Tree class now conceptually has an extra node, the “virtual root” whose children are the roots of the tree. The quintuply linked tree arrays (parent_array, left_child_array, right_child_array, left_sib_array and right_sib_array) all have one extra element. (jeromekelleher)
# Get a tree to look at
t = ts.first()
t
Tree | |
---|---|
Index | 0 |
Interval | 0-10 (10) |
Roots | 1 |
Nodes | 79 |
Sites | 10 |
Mutations | 180 |
Total Branch Length | 17.525159 |
#How many usual nodes are in the tree
len(t.preorder())
79
#The quintuply linked arrays have an extra element - the virtual root
len(t.parent_array)
80
#The normal root is the penultimate node in those arrays:
t.root
78
#And it's parent is NULL as before
t.parent_array[t.root]
-1
#The virtual root is the last element
t.virtual_root
79
# And its children are the normal roots
t.left_child_array[t.virtual_root]
78
-
Tree traversal orders returned by the
nodes
method have changed when there are multiple roots. Previously orders were defined locally for each root, but are now globally across all roots. (jeromekelleher) -
Individuals are no longer guaranteed or required to be topologically sorted in a tree sequence.
TableCollection.sort
no longer sorts individuals.(benjeffery) -
Metadata encoding errors now raise
MetadataEncodingError
# Encode some bad metadata and see that we get a nice, helpful error now.
tskit.metadata.JSONCodec({"codec":"json"}).encode({"a":np.array([])})
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~/projects/tskit/python/tskit/metadata.py in encode(self, obj)
158 try:
--> 159 return tskit.canonical_json(obj).encode()
160 except TypeError as e:
~/projects/tskit/python/tskit/util.py in canonical_json(obj)
57 """
---> 58 return json.dumps(obj, sort_keys=True, separators=(",", ":"))
59
~/miniconda3/lib/python3.9/json/__init__.py in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
233 cls = JSONEncoder
--> 234 return cls(
235 skipkeys=skipkeys, ensure_ascii=ensure_ascii,
~/miniconda3/lib/python3.9/json/encoder.py in encode(self, o)
198 # equivalent to the PySequence_Fast that ''.join() would do.
--> 199 chunks = self.iterencode(o, _one_shot=True)
200 if not isinstance(chunks, (list, tuple)):
~/miniconda3/lib/python3.9/json/encoder.py in iterencode(self, o, _one_shot)
256 self.skipkeys, _one_shot)
--> 257 return _iterencode(o, 0)
258
~/miniconda3/lib/python3.9/json/encoder.py in default(self, o)
178 """
--> 179 raise TypeError(f'Object of type {o.__class__.__name__} '
180 f'is not JSON serializable')
TypeError: Object of type ndarray is not JSON serializable
During handling of the above exception, another exception occurred:
MetadataEncodingError Traceback (most recent call last)
/tmp/ipykernel_2089335/467480944.py in <module>
1 # Encode some bad metadata and see that we get a nice, helpful error now.
----> 2 tskit.metadata.JSONCodec({"codec":"json"}).encode({"a":np.array([])})
~/projects/tskit/python/tskit/metadata.py in encode(self, obj)
159 return tskit.canonical_json(obj).encode()
160 except TypeError as e:
--> 161 raise exceptions.MetadataEncodingError(
162 f"Could not encode metadata of type {str(e).split()[3]}"
163 )
MetadataEncodingError: Could not encode metadata of type ndarray
-
Change default value for
missing_data_char
in theTreeSequence.haplotypes
method from “-“ to “N”. This is a more idiomatic usage to indicate missing data rather than a gap in an alignment. (jeromekelleher) -
Allow skipping of site and mutation tables in
TableCollection.sort
(benjeffery)
# To skip sorting, specifiy to start sorting at the end.
tc.sort(site_start=ts.num_sites, mutation_start=ts.num_mutations)
- Add
TableCollection.sort_individuals
to sort the individuals as this is no longer done by the default sort. (benjeffery)
# Method is similar to other sorts
tc.sort_individuals()
- Add
__setitem__
to all tables allowing single rows to be updated. Ragged arrays are not reallocated if the size of the row is unchanged. (jeromekelleher,benjeffery)
# Lets replace this row for one with metadata
tc.nodes[0]
NodeTableRow(flags=1, time=0.0, population=0, individual=0, metadata=b'')
#Firstly set a schema
tc.nodes.metadata_schema = tskit.metadata.MetadataSchema({"codec":"json"})
#Then we can simply assign a row with replaced metadata
tc.nodes[0] = tc.nodes[0].replace(metadata={"foo":"bar"})
#If you want to replace an attribute on all rows - this way is faster:
copy=tc.nodes.copy()
tc.nodes.clear()
for n in copy:
tc.nodes.append(n.replace(metadata={"foo":"bar"}))
- Added a new parameter
time
toTreeSequence.samples()
allowing to select samples at a specific time point or time interval. (mufernando, petrelharp)
ts.samples()
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39], dtype=int32)
#Specify time as an interval
ts.samples(time=(0.1,0.2))
array([], dtype=int32)
- Add
table.metadata_vector
to all table classes to allow easy extraction of a single metadata key into an array (petrelharp).
#old way md = [n.metadata["foo"] for n in tc.nodes]
tc.nodes.metadata_vector(key=["foo"], default_value="wtf")
array(['bar', 'bar', 'bar', 'bar', 'bar', 'bar', 'bar', 'bar', 'bar',
'bar', 'bar', 'bar', 'bar', 'bar', 'bar', 'bar', 'bar', 'bar',
'bar', 'bar', 'bar', 'bar', 'bar', 'bar', 'bar', 'bar', 'bar',
'bar', 'bar', 'bar', 'bar', 'bar', 'bar', 'bar', 'bar', 'bar',
'bar', 'bar', 'bar', 'bar', 'bar', 'bar', 'bar', 'bar', 'bar',
'bar', 'bar', 'bar', 'bar', 'bar', 'bar', 'bar', 'bar', 'bar',
'bar', 'bar', 'bar', 'bar', 'bar', 'bar', 'bar', 'bar', 'bar',
'bar', 'bar', 'bar', 'bar', 'bar', 'bar', 'bar', 'bar', 'bar',
'bar', 'bar', 'bar', 'bar', 'bar', 'bar', 'bar'], dtype='<U3')
- Add
time_units
toTreeSequence
to describe the units of the time dimension of the tree sequence. This is then used to generate an error iftime_units
isuncalibrated
when using the branch lengths in statistics. (benjeffery)
tc.time_units
'unknown'
#Make a tree sequence where the time dimension is uncalibrated
tc.time_units=tskit.TIME_UNITS_UNCALIBRATED
tc.time_units
ts = tc.tree_sequence()
#An error gets thrown if you try to do a statistic that makes no sense given the dimension
ts.allele_frequency_spectrum(mode="branch")
---------------------------------------------------------------------------
LibraryError Traceback (most recent call last)
/tmp/ipykernel_2089335/3925147128.py in <module>
1 #An error gets thrown if you try to do a statistic that makes no sense given the dimension
----> 2 ts.allele_frequency_spectrum(mode="branch")
~/projects/tskit/python/tskit/trees.py in allele_frequency_spectrum(self, sample_sets, windows, mode, span_normalise, polarised)
7214 if sample_sets is None:
7215 sample_sets = [self.samples()]
-> 7216 return self.__one_way_sample_set_stat(
7217 self._ll_tree_sequence.allele_frequency_spectrum,
7218 sample_sets,
~/projects/tskit/python/tskit/trees.py in __one_way_sample_set_stat(self, ll_method, sample_sets, windows, mode, span_normalise, polarised)
6499
6500 flattened = util.safe_np_int_cast(np.hstack(sample_sets), np.int32)
-> 6501 stat = self.__run_windowed_stat(
6502 windows,
6503 ll_method,
~/projects/tskit/python/tskit/trees.py in __run_windowed_stat(self, windows, method, *args, **kwargs)
6461 strip_dim = windows is None
6462 windows = self.parse_windows(windows)
-> 6463 stat = method(*args, **kwargs, windows=windows)
6464 if strip_dim:
6465 stat = stat[0]
LibraryError: Statistics using branch lengths cannot be calculated when time_units is 'uncalibrated'
-
Improved performance for tree traversal methods in the
nodes
iterator. Roughly a 10X performance increase for “preorder”, “postorder”, “timeasc” and “timedesc” (jeromekelleher) -
Substantial performance improvement for
Tree.total_branch_length
(jeromekelleher) -
Add the
discrete_genome
property to the TreeSequence class which is true if all coordinates are discrete (jeromekelleher)
# For our example all the positions are integers
ts.discrete_genome
True
- Add the
discrete_time
property to the TreeSequence class which is true if all time coordinates are discrete or unknown (benjeffery)
ts.discrete_time
False
- Add the
TreeSequence.alignments
method. Uses the reference sequence if there is one or one can be specifed.(jeromekelleher)
list(ts.alignments())
list(ts.alignments(reference_sequence=tskit.random_nucleotides(ts.sequence_length)))
['GAGTCCTGCA',
'GAAGCCAATG',
'AAGCACATGA',
'GTCTCGTCTA',
'GAGTCATACT',
'GAGTCATACT',
'CATGTGATAA',
'TTATCGCACT',
'CCCGTGCACC',
'AAACTCCTGT',
'TAAAACATGA',
'TTCTCAATCG',
'GAATCGAATA',
'TTATCGGACG',
'GAAGCTAGCC',
'TTATCGCAAT',
'AAGTACATGA',
'TTATCGGACG',
'TTATCGGACG',
'TTCTCAATCG',
'GAACCTAGTC',
'TGGGAATTCG',
'TACAACATTA',
'GAAGCCAATC',
'AAGTACATGA',
'CGCGCCCGCC',
'ACTTTAATGC',
'GAACCGAGTC',
'GAACCGAGTC',
'TTATCGCAAT',
'TTATCGCACC',
'ACCTACATGC',
'GAACCGAGTC',
'CCCGTGTACC',
'GTATCGCACT',
'AAGTACATGA',
'ACATACATGC',
'ATGTCGTATA',
'GAGACATACC',
'TACTACATTA']
Reference sequence
You can now store a reference in the tree sequence! (jeromekelleher, benjeffery) This API is preliminary, we intend to add ways to fetch from URLs or files to avoid duplication.
tc.reference_sequence.data = "ATCG"
tc.reference_sequence.metadata_schema = tskit.MetadataSchema({"codec":"json"})
tc.reference_sequence.metadata = {"accession":"someid", "other":"stuff"}
tc.reference_sequence.data
'ATCG'
tc.reference_sequence.metadata
{'accession': 'someid', 'other': 'stuff'}