API Documentation#
Variant data#
- class tsinfer.VariantData(path_or_zarr, ancestral_state, *, sample_mask=None, site_mask=None, sites_time=None, individuals_time=None, individuals_location=None, individuals_population=None, individuals_flags=None, sequence_length=None)[source]#
Class representing input variant data used for inference. This is mostly a thin wrapper for a Zarr dataset storing information in the VCF Zarr (.vcz) format, plus information specifing the ancestral allele and (optional) data masks. It then provides various derived properties and methods for accessing the data in a form suitable for inference.
Note
In the VariantData object, “samples” refer to the individuals in the dataset, each of which can be of arbitrary ploidy. This is in contrast to
tskit
, in which each haploid genome is treated as a separate “sample”. For example in a diploid dataset, the inferred tree sequence returned at the end of the inference process will haveinferred_ts.num_samples
equal to double the number returned byVariantData.num_samples
.- Parameters:
path_or_zarr (Union(str, zarr.Group)) – The input dataset in VCF Zarr format. This can either a path to the Zarr dataset saved on disk, or the Zarr object itself.
ancestral_state (Union(array, str)) – A numpy array of strings specifying the ancestral states (alleles) used in inference. This must be the same length as the number of unmasked sites in the dataset. Alternatively, a single string can be provided, giving the name of an array in the input dataset which contains the ancestral states. Unknown ancestral states can be specified using “N”. Any ancestral states which do not match any of the known alleles at that site, will be tallied, and a warning issued summarizing the unknown ancestral states. Note that allelic states are case-sensitive in tsinfer.
sample_mask (Union(array, str)) – A numpy array of booleans specifying which samples to mask out (exclude) from the dataset. Alternatively, a string can be provided, giving the name of an array in the input dataset which contains the sample mask. If
None
(default), all samples are included.site_mask (Union(array, str)) – A numpy array of booleans specifying which sites to mask out (exclude) from the dataset. Alternatively, a string can be provided, giving the name of an array in the input dataset which contains the site mask. If
None
(default), all sites are included.sites_time (Union(array, str)) – A numpy array of floats specifying the relative time of occurrence of the mutation to the derived state at each site. This must be of the same length as the number of unmasked sites. Alternatively, a string can be provided, giving the name of an array in the input dataset which contains the site times. If
None
(default), the frequency of the derived allele is used as a proxy for the time of occurrence: this is usually a reasonable approximation to the relative order of ancestors used for inference. Time values are ignored for sites not used in inference, such as singletons, sites with more than two alleles, or sites with an unknown ancestral state.individuals_time (Union(array, str)) – A numpy array of floats specifying the time of each individual in the dataset. This must be the same length as the number of unmasked individuals. Alternatively, a string can be provided, giving the name of an array in the input dataset which contains the individual times. If
None
(default), individuals are assumed to havetskit.UNKNOWN_TIME
.individuals_location (Union(array, str)) – A numpy array specifying the location of each individual in the dataset. This must be the same length as the number of unmasked individuals. Alternatively, a string can be provided, giving the name of an array in the input dataset which contains the individual locations. If
None
(default), individuals are assumed to have empty location arrays.individuals_population (Union(array, str)) – A numpy array of integers specifying the population of each individual in the dataset. This must be the same length as the number of unmasked individuals. Alternatively, a string can be provided, giving the name of an array in the input dataset which contains the individual populations. If
None
(default), individuals are assumed to havetskit.NULL
as their population.individuals_flags (Union(array, str)) – A numpy array of integers specifying the flags of each individual in the dataset. This must be the same length as the number of unmasked individuals. Alternatively, a string can be provided, giving the name of an array in the input dataset which contains the individual flags. If
None
(default), individuals are assumed to have flags set to 0.sequence_length (int) – An integer specifying the resulting sequence_length attribute of the output tree sequence. If not specified the contig_length attribute from the undelying zarr store for the contig of the selected variants. is used. If that is not present then the maximum position plus one of the used variants is used.
- tsinfer.add_ancestral_state_array(zarr_group, fasta_string, *, array_name='ancestral_state')[source]#
Add an ancestral state array to a zarr group from a string of nucleotides representing the ancestral sequence.
- Parameters:
zarr_group (zarr.Group) – A zarr group to add the ancestral state array to. This should be a VCF Zarr format group containing a “variant_position” array.
fasta_string (str) – A string containing the ancestral sequence, from e.g. FASTA.
array_name (str) – The name of the array to create in the zarr group. Default is “ancestral_state”.
- Returns:
The newly inserted ancestral state array.
- Return type:
Ancestor data#
- tsinfer.load(path)[source]#
Loads a tsinfer
AncestorData
file from the specified path. If the file is format not recognised aFileFormatError
will be thrown.- Parameters:
path (str) – The path of the file we wish to load.
- Returns:
The corresponding
AncestorData
instance opened in read only mode.- Return type:
- Raises:
FileFormatError
if the file cannot be read.
- class tsinfer.AncestorData(position, sequence_length, *, path=None, num_flush_threads=0, compressor=None, chunk_size=1024, max_file_size=None)[source]#
Class representing the stored ancestor data produced by
generate_ancestors()
. See the ancestor data file format specifications for details on the structure of this file.Note
If a
path
is specified, themax_file_size
option puts an upper limit on the possible size of the created file. On non-Windows systems, space for this file is not immediately allocated but just “reserved” using sparse file systems. However, on Windows systems the file is allocated immediately, somax_file_size
takes a smaller default value, to avoid allocating very large files for no reason. Users who wish to run large inferences on Windows may therefore need to explictly set an appropriatemax_file_size
. Note that themax_file_size
is only used while the file is being built: one the file has been finalised, it is shrunk to its minimum size.- Parameters:
position (arraylike) – Integer array of the site positions of the ancestors. All values should be >0 and the array should be monotonically increasing.
sequence_length (float) – Total length of the sequence, site positions must be less than this value.
path (str) – The path of the file to store the ancestor data. If None, the information is stored in memory and not persistent.
num_flush_threads (int) – The number of background threads to use for compressing data and flushing to disc. If <= 0, do not spawn any threads but use a synchronous algorithm instead. Default=0.
compressor (numcodecs.abc.Codec) – A
numcodecs.abc.Codec
instance to use for compressing data. Any codec may be used, but problems may occur with very large datasets on certain codecs as they cannot compress buffers >2GB. If None, do not use any compression. Default=:class:numcodecs.zstd.Zstd.chunk_size (int) – The chunk size used for zarr arrays in the sample dimension. This affects compression level and algorithm performance. Default=1024.
chunk_size_sites (int) –
The chunk size used for the genotype zarr arrays in the sites dimension. This affects compression level and algorithm performance. Default=16384.
max_file_size (int) – If a file is being used to store this data, set a maximum size in bytes for the stored file. If None, the default value of 1GiB is used on Windows and 1TiB on other platforms (see above for details).
- data_equal(other)[source]#
Returns True if all the data attributes of this input file and the specified input file are equal. This compares every attribute.
- property sequence_length#
Returns the sequence length.
- property num_sites#
The number of inference sites used to generate the ancestors
- property sites_position#
The positions of the inference sites used to generate the ancestors
- add_provenance(timestamp, record)#
Adds a new provenance record with the specified timestamp and record. Timestamps should ISO8601 formatted, and record is some JSON encodable object.
- arrays()#
Returns a list of all the zarr arrays in this DataContainer.
- clear_provenances()#
Clear all provenances in this instance
- close()#
Close this DataContainer. Any read or write operations attempted after calling this will fail.
- copy(path=None, max_file_size=None)#
Returns a copy of this DataContainer opened in ‘edit’ mode. If path is specified, this must not be equal to the path of the current data container.
- property file_size#
Returns the size of the underlying file, or -1 if we do not have a file associated.
- property info#
Returns a string containing the zarr info for each array.
- provenances()#
Returns an iterator over the (timestamp, record) pairs representing the provenances for this data container.
- record_provenance(command=None, resources=None, **kwargs)#
Records the provenance information for this file using the tskit provenances schema.
- property ancestors_length#
Returns the length of ancestors in physical coordinates.
- insert_proxy_samples(variant_data, *, sample_ids=None, epsilon=None, allow_mutation=False, **kwargs)[source]#
Take a set of samples from a
VariantData
instance and create additional “proxy sample ancestors” from them, returning a newAncestorData
instance including both the current ancestors and the additional ancestors at the appropriate time points.A proxy sample ancestor is an ancestor based upon a known sample. At sites used in the full inference process, the haplotype of this ancestor is identical to that of the sample on which it is based. The time of the ancestor is taken to be a fraction
epsilon
older than the sample on which it is based.A common use of this function is to provide ancestral nodes for anchoring historical samples at the correct time when matching them into a tree sequence during the
tsinfer.match_samples()
stage of inference. For this reason, by default, the samples chosen fromsample_data
are those associated with historical (i.e. non-contemporary) individuals. This can be altered by using thesample_ids
parameter.Note
The proxy sample ancestors inserted here will correspond to extra nodes in the inferred tree sequence. At sites which are not used in the full inference process (e.g. sites unique to a single historical sample), these proxy sample ancestor nodes may have a different genotype from their corresponding sample.
- Parameters:
variant_data (VariantData) – The VariantData instance from which to select the samples used to create extra ancestors.
sample_ids (list(int)) – A list of sample ids in the
variant_data
instance that will be selected to create the extra ancestors. IfNone
(default) select all the historical samples, i.e. those associated with an Individual whose time is greater than zero. The order of ids is ignored, as are duplicate ids.epsilon (list(float)) – An list of small time increments determining how much older each proxy sample ancestor is than the corresponding sample listed in
sample_ids
. A single value is also allowed, in which case it is used as the time increment for all selected proxy sample ancestors. If None (default) find \({\delta}t\), the smallest time difference between between the sample times and the next oldest ancestor in the currentAncestorData
instance, settingepsilon
= \({\delta}t / 100\) (or, if all selected samples are at least as old as the oldest ancestor, take \({\delta}t\) to be the smallest non-zero time difference between existing ancestors).allow_mutation (bool) – If
False
(the default), any site in a proxy sample ancestor that has a derived allele must have a pre-existing mutation in an older (non-proxy) ancestor, otherwise an error is raised. Alternatively, ifallow_mutation
isTrue
, proxy ancestors can contain a de-novo mutation at a site that also has a mutation elsewhere (i.e. breaking the infinite sites assumption), allowing them to possess derived alleles at sites where there are no pre-existing mutations in older ancestors.**kwargs – Further arguments passed to the constructor when creating the new
AncestorData
instance which will be returned.
- Returns:
A new
AncestorData
object.- Return type:
- truncate_ancestors(lower_time_bound, upper_time_bound, length_multiplier=2, buffer_length=1000, **kwargs)[source]#
Truncates the length of ancestors above a given time and returns a new
AncestorData
instance.Given a set of haplotypes H such that
lower_time_bound
<=h.time
<upper_time_bound
, we letmax_len = length_multiplier * max(max(h.length) for h in H)
. Then, we truncate all haplotypes containing at least one focal site whereh.time >= upper
, ensuring these haplotypes extend no further than half ofmax_len
to the either side of the leftmost and rightmost focal sites of the ancestral haplotype. Note that ancestors aboveupper_time_bound
may still be longer thanmax_len
if the ancestor contains greater than 2 focal sites.This function should be used when
tsinfer.generate_ancestors()
generates old ancestors which are very long, as these can significantly slow down matching time. Older ancestors should generally be shorter than younger ancestors, so truncating the lengths of older ancestors has negligible effect on inference accuracy.Note
Please ensure that the time values provided to
lower_time_bound
andupper_time_bound
match the units used in theAncestorData
file, i.e. if your ancestors do not have site times specified,upper_time_bound
should be between 0 and 1.- Parameters:
lower_time_bound (float) – Defines the lower bound (inclusive) of the half open interval where we search for a truncation value.
upper_time_bound (float) – Defines the upper bound (exclusive) of the half open interval where we search for a truncation value. The truncation value is the length of the longest haplotype in this interval multiplied by
length_multiplier
. The length of ancestors as old or older thanupper_time_bound
will be truncated using this value.length_multiplier (float) – A multiplier for the length of the longest ancestor in the half-open interval between
lower_time_bound
(inclusive) anduppper_time_bound
(exclusive), i.e. if the longest ancestor in the interval is 1 megabase, alength_multiplier
of 2 creates a maximum length of 2 megabases.buffer_length (int) – The number of changed ancestors to buffer before writing to disk.
**kwargs – Further arguments passed to the
AncestorData.copy()
when creating the newAncestorData
instance which will be returned.
- Returns:
A new
AncestorData
object.- Return type:
- add_ancestor(start, end, time, focal_sites, haplotype)[source]#
Adds an ancestor with the specified haplotype, with ancestral material over the interval [start:end], that is associated with the specified timepoint and has new mutations at the specified list of focal sites. Ancestors should be added in time order, with the oldest first. The id of the added ancestor is returned.
Todo
Add documentation for the data attributes in read-mode.
Running inference#
- tsinfer.infer(variant_data, *, recombination_rate=None, mismatch_ratio=None, path_compression=True, exclude_positions=None, post_process=None, num_threads=0)[source]#
Runs the full inference pipeline on the specified
VariantData
instance and returns the inferredtskit.TreeSequence
. See matching ancestors & samples in the documentation for details ofrecombination_rate
,mismatch_ratio
andpath_compression
.Note
For finer grained control over inference, for example to set mismatch ratios when matching ancestors as well as when matching samples, run
tsinfer.generate_ancestors()
,tsinfer.match_ancestors()
andtsinfer.match_samples()
separately.- Parameters:
variant_data (VariantData) – The input
VariantData
instance representing the observed data that we wish to make inferences from.recombination_rate (float, msprime.RateMap) – Either a floating point value giving a constant rate \(\rho\) per unit length of genome, or an
msprime.RateMap
object. This is used to calculate the probability of recombination between adjacent sites in the match_samples stage. IfNone
, all matching conflicts are resolved by recombination and all inference sites will have a single mutation (equivalent to mismatch_ratio near zero).mismatch_ratio (float) – The probability of a mismatch relative to the median probability of recombination between adjacent sites: can only be used if a recombination rate has been set (default:
None
treated as 1 ifrecombination_rate
is set). This is only applied in the match_samples stage.path_compression (bool) – Whether to merge edges that share identical paths (essentially taking advantage of shared recombination breakpoints).
post_process (bool) – Whether to run the
post_process()
method on the the tree sequence which, among other things, removes ancestral material that does not end up in the current samples (if not specified, defaults toTrue
)exclude_positions (array_like) – A list of site positions to exclude for full inference. Sites with these positions will not be used to generate ancestors, and not used during the copying process. Any such sites that exist in the sample data file will be included in the trees after the main inference process using parsimony. The list does not need to be in to be in any particular order, and can include site positions that are not present in the sample data file.
num_threads (int) – The number of worker threads to use in parallelised sections of the algorithm. If <= 0, do not spawn any threads and use simpler sequential algorithms (default).
simplify (bool) – When post_processing, only simplify the tree sequence. deprecated but retained for backwards compatibility (default:
None
).
- Returns:
The
tskit.TreeSequence
object inferred from the input sample data.- Return type:
- tsinfer.generate_ancestors(variant_data, *, path=None, exclude_positions=None, num_threads=0, genotype_encoding=None, mmap_temp_dir=None, **kwargs)[source]#
Runs the ancestor generation algorithm on the specified
VariantData
instance and returns the resultingAncestorData
instance. If you wish to store generated ancestors persistently on file you must pass thepath
keyword argument to this function. For example,ancestor_data = tsinfer.generate_ancestors(variant_data, path="mydata.ancestors")
Other keyword arguments are passed to the
AncestorData
constructor, which may be used to control the storage properties of the generated file.Ancestor generation involves loading the entire genotype matrix into memory, by default using one byte per haploid genotype, which can be prohibitively large when working with sample sizes of 100,000 or more. There are two options to help mitigate memory usage. The
genotype_encoding
parameter allows the user to specify a more compact encoding scheme, which reduces storage space for datasets with small numbers of alleles. Currently, theGenotypeEncoding.ONE_BIT
encoding is supported, which provides 8-fold compression of biallelic, non-missing data. An error is raised if an encoding that does not support the range of values present in a given dataset is provided.The second option for reducing the RAM footprint of this function is to use the
mmap_temp_dir
parameter. This allows the genotype data to be cached on file, transparently using the operating system’s virtual memory subsystem to swap in and out the data. This can work well if the encoded genotype matrix almost fits in RAM and fast local storage is available. However, if the size of the encoded genotype matrix is more than, say, twice the available RAM it is unlikely that this function will complete in a reasonable time-frame. A temporary file is created in the specifiedmmap_temp_dir
, which is automatically deleted when the function completes.Warning
The
mmap_temp_dir
option is a silent no-op on Windows!- Parameters:
variant_data (VariantData) – The
VariantData
instance that we are generating putative ancestors from.path (str) – The path of the file to store the ancestor data. If None, the information is stored in memory and not persistent.
exclude_positions (array_like) – A list of site positions to exclude for full inference. Sites with these positions will not be used to generate ancestors, and not used during the copying process. The list does not need be in any particular order.
num_threads (int) – The number of worker threads to use. If < 1, use a simpler synchronous algorithm.
genotype_encoding (int) – The encoding to use for genotype data internally when generating ancestors. See the
GenotypeEncoding
class for the available options. Defaults to one-byte per genotype.mmap_temp_dir (str) – The directory within which to create the temporary backing file when using mmaped memory for bulk genotype storage. If None (the default) allocate memory directly using the standard mechanism. This is an advanced option, usually only relevant when working with very large datasets (see above for more information).
- Returns:
The inferred ancestors stored in an
AncestorData
instance.- Return type:
- class tsinfer.GenotypeEncoding(value)[source]#
The encoding scheme used to store genotypes.
- EIGHT_BIT = 0#
The default approach of using one-byte per genotype. Supports up to 127 alleles and missing data.
- ONE_BIT = 1#
Encode binary genotype data using a single bit.
- tsinfer.match_ancestors(variant_data, ancestor_data, *, recombination_rate=None, mismatch_ratio=None, path_compression=True, num_threads=0)[source]#
Run the ancestor matching algorithm on the specified
VariantData
andAncestorData
instances, returning the resultingtskit.TreeSequence
representing the complete ancestry of the putative ancestors. See matching ancestors & samples in the documentation for details ofrecombination_rate
,mismatch_ratio
andpath_compression
.- Parameters:
variant_data (VariantData) – The
VariantData
instance representing the input data.ancestor_data (AncestorData) – The
AncestorData
instance representing the set of ancestral haplotypes for which we are finding a history.recombination_rate (float, msprime.RateMap) – Either a floating point value giving a constant rate \(\rho\) per unit length of genome, or an
msprime.RateMap
object. This is used to calculate the probability of recombination between adjacent sites. IfNone
, all matching conflicts are resolved by recombination and all inference sites will have a single mutation (equivalent to mismatch_ratio near zero)mismatch_ratio (float) – The probability of a mismatch relative to the median probability of recombination between adjacent sites: can only be used if a recombination rate has been set (default:
None
treated as 1 ifrecombination_rate
is set).path_compression (bool) – Whether to merge edges that share identical paths (essentially taking advantage of shared recombination breakpoints).
num_threads (int) – The number of match worker threads to use. If this is <= 0 then a simpler sequential algorithm is used (default).
- Returns:
The ancestors tree sequence representing the inferred history of the set of ancestors.
- Return type:
- tsinfer.match_samples(variant_data, ancestors_ts, *, recombination_rate=None, mismatch_ratio=None, path_compression=True, post_process=None, indexes=None, force_sample_times=False, num_threads=0)[source]#
Runs the sample matching algorithm on the specified
VariantData
instance and ancestors tree sequence, returning the finaltskit.TreeSequence
instance containing the full inferred history for all samples and sites. See matching ancestors & samples in the documentation for details ofrecombination_rate
,mismatch_ratio
andpath_compression
.- Parameters:
variant_data (VariantData) – The
VariantData
instance representing the input data.ancestors_ts (tskit.TreeSequence) – The
tskit.TreeSequence
instance representing the inferred history among ancestral ancestral haplotypes.recombination_rate (float, msprime.RateMap) – Either a floating point value giving a constant rate \(\rho\) per unit length of genome, or an
msprime.RateMap
object. This is used to calculate the probability of recombination between adjacent sites. IfNone
, all matching conflicts are resolved by recombination and all inference sites will have a single mutation (equivalent to mismatch_ratio near zero)mismatch_ratio (float) – The probability of a mismatch relative to the median probability of recombination between adjacent sites: can only be used if a recombination rate has been set (default:
None
treated as 1 ifrecombination_rate
is set).path_compression (bool) – Whether to merge edges that share identical paths (essentially taking advantage of shared recombination breakpoints).
indexes (array_like) – An array of indexes into the variant_data file of the samples to match (in increasing order) or None for all samples.
post_process (bool) – Whether to run the
post_process()
method on the the tree sequence which, among other things, removes ancestral material that does not end up in the current samples (if not specified, defaults toTrue
)force_sample_times (bool) – After matching, should an attempt be made to adjust the time of “historical samples” (those associated with an individual having a non-zero time) such that the sample nodes in the tree sequence appear at the time of the individual with which they are associated.
num_threads (int) – The number of match worker threads to use. If this is <= 0 then a simpler sequential algorithm is used (default).
simplify (bool) – Treated as an alias for
post_process
, deprecated but currently retained for backwards compatibility if set toFalse
.overlay_non_inference_sites (bool) – If True, sites that were included in the selected sites, but were not used for inference, will be added to the tree sequence by mapping their mutations over the inferred topology. Defaults to True.
- Returns:
The tree sequence representing the inferred history of the sample.
- Return type:
- tsinfer.augment_ancestors(variant_data, ancestors_ts, indexes, *, recombination_rate=None, mismatch_ratio=None, path_compression=True, num_threads=0)[source]#
Runs the sample matching algorithm on the specified
VariantData
instance and ancestors tree sequence, for the specified subset of sample indexes, returning thetskit.TreeSequence
instance including these samples. This tree sequence can then be used as an ancestors tree sequence for subsequent matching against all samples. See matching ancestors & samples in the documentation for details ofrecombination_rate
,mismatch_ratio
andpath_compression
.- Parameters:
variant_data (VariantData) – The
VariantData
instance representing the input data.ancestors_ts (tskit.TreeSequence) – The
tskit.TreeSequence
instance representing the inferred history among ancestral ancestral haplotypes.indexes (array) – The sample indexes to insert into the ancestors tree sequence, in increasing order.
recombination_rate (float, msprime.RateMap) – Either a floating point value giving a constant rate \(\rho\) per unit length of genome, or an
msprime.RateMap
object. This is used to calculate the probability of recombination between adjacent sites. IfNone
, all matching conflicts are resolved by recombination and all inference sites will have a single mutation (equivalent to mismatch_ratio near zero)mismatch_ratio (float) – The probability of a mismatch relative to the median probability of recombination between adjacent sites: can only be used if a recombination rate has been set (default:
None
treated as 1 ifrecombination_rate
is set).path_compression (bool) – Whether to merge edges that share identical paths (essentially taking advantage of shared recombination breakpoints).
num_threads (int) – The number of match worker threads to use. If this is <= 0 then a simpler sequential algorithm is used (default).
- Returns:
The specified ancestors tree sequence augmented with copying paths for the specified sample.
- Return type:
- tsinfer.post_process(ts, *, split_ultimate=None, erase_flanks=None)[source]#
Post-process a tsinferred tree sequence into a more conventional form. This is the function run by default on the final tree sequence output by
match_samples()
. It involves the following 4 steps:If the oldest node is connected to a single child via an edge that spans the entire tree sequence, this oldest node is removed, so that its child becomes the new root (this step is undertaken to remove the “virtual-root-like node” which is added to ancestor tree sequences to enable matching).
If the oldest node is removed in the first step and the new root spans the entire genome, it is treated as the “ultimate ancestor” and (unless
split_ultimate
isFalse
) the node is split into multiple coexisiting nodes with the splits occurring whenever the children of the ultimate ancestor change. The rationale is that tsinfer creates a single ancestral haplotype with all inference sites in the ancestral state: this is, however, unlikely to represent a single ancestor in the past. If the tree sequence is then dated, the fact that ultimate ancestor is split into separate nodes allows these nodes to be dated to different times.Often, extensive regions of genome exist before the first defined site and after the last defined site. Since no data exists in these sections of the genome, post processing by default erases the inferred topology in these regions. However, if
erase_flanks
is False, the flanking regions at the start and end will be assigned the same topology as inferred at the first and last site respectively.The sample nodes are reordered such that they are the first nodes listed in the node table, removing tree nodes and edges that are not on a path between the root and any of the samples (by applying the
simplify()
method withkeep_unary
set to True butfilter_sites
,filter_populations
andfilter_individuals
set to False).
- Parameters:
split_ultimate (bool) – If
True
(default) and the oldest node is the only parent to a single “ultimate ancestor” node, attempt to split this node into many separate nodes (see above). IfFalse
do not attempt to identify or split an ultimate ancestor node.erase_flanks (bool) – If
True
(default), keep only the inferred topology between the first and last sites. IfFalse
, output regions of topology inferred before the first site and after the last site.
- Returns:
The post-processed tree sequence.
- Return type:
Batched inference#
- tsinfer.match_ancestors_batch_init(work_dir, variant_data_path, ancestral_state, ancestor_data_path, min_work_per_job, *, max_num_partitions=None, sample_mask=None, site_mask=None, sequence_length=None, recombination_rate=None, mismatch_ratio=None, path_compression=True, recombination=None, mismatch=None, precision=None, engine='C', extended_checks=False, time_units=None, record_provenance=True)[source]#
match_ancestors_batch_init(work_dir, variant_data_path, ancestral_state, ancestor_data_path, min_work_per_job, *, max_num_partitions=None, sample_mask=None, site_mask=None, recombination_rate=None, mismatch_ratio=None, path_compression=True)
Initialise a batched ancestor matching job. This function is used to prepare a working directory for running a batched ancestor matching job. The job is split into groups of ancestors, with each group further split into partitions of ancestors if necessary. work_dir is created and details are written to metadata.json in work_dir. The job can then be run using
match_ancestors_batch_groups()
andmatch_ancestors_batch_group_partition()
then finallymatch_ancestors_batch_group_finalise()
. See large scale inference for more details about how these methods work together. Seematch_ancestors()
for details on ancestor matching.- Parameters:
work_dir (str) – The directory in which to store the working files.
variant_data_path (str) –
The input dataset in VCF Zarr format. Path to the Zarr dataset saved on disk. See
VariantData
.ancestral_state (Union(array, str)) – A numpy array of strings specifying the ancestral states (alleles) used in inference. This must be the same length as the number of unmasked sites in the dataset. Alternatively, a single string can be provided, giving the name of an array in the input dataset which contains the ancestral states. Unknown ancestral states can be specified using “N”. Any ancestral states which do not match any of the known alleles at that site, will be tallied, and a warning issued summarizing the unknown ancestral states.
ancestor_data_path (str) – The path to the file containing the ancestors generated by
generate_ancestors()
.min_work_per_job (int) – The minimum amount of work (as a count of genotypes) to allocate to a single parallel job. If the amount of work in a group of ancestors exceeds this level it will be broken up into parallel partitions, subject to the constraint of max_num_partitions.
max_num_partitions (int) – The maximum number of partitions to split a group of ancestors into. Useful for limiting the number of jobs in a workflow to avoid job overhead. Defaults to 1000.
sample_mask (Union(array, str)) – A numpy array of booleans specifying which samples to mask out (exclude) from the dataset. Alternatively, a string can be provided, giving the name of an array in the input dataset which contains the sample mask. If
None
(default), all samples are included.site_mask (Union(array, str)) – A numpy array of booleans specifying which sites to mask out (exclude) from the dataset. Alternatively, a string can be provided, giving the name of an array in the input dataset which contains the site mask. If
None
(default), all sites are included.sequence_length (int) – An integer specifying the resulting sequence_length attribute of the output tree sequence. If not specified the contig_length attribute from the undelying zarr store for the contig of the selected variants. is used. If that is not present then the maximum position plus one of the used variants is used.
recombination_rate (float, msprime.RateMap) – Either a floating point value giving a constant rate \(\rho\) per unit length of genome, or an
msprime.RateMap
object. This is used to calculate the probability of recombination between adjacent sites. IfNone
, all matching conflicts are resolved by recombination and all inference sites will have a single mutation (equivalent to mismatch_ratio near zero)mismatch_ratio (float) – The probability of a mismatch relative to the median probability of recombination between adjacent sites: can only be used if a recombination rate has been set (default:
None
treated as 1 ifrecombination_rate
is set).path_compression (bool) – Whether to merge edges that share identical paths (essentially taking advantage of shared recombination breakpoints).
- Returns:
A dictionary of the job metadata, as written to metadata.json in work_dir. ancestor_grouping in this dict contains the grouping of ancestors into groups and should be used to guide calling
match_ancestors_batch_groups()
andmatch_ancestors_batch_group_partition()
.- Return type:
- tsinfer.match_ancestors_batch_groups(work_dir, group_index_start, group_index_end, num_threads=0)[source]#
match_ancestors_batch_groups(work_dir, group_index_start, group_index_end, num_threads=0)
Match a set of ancestor groups from group_index_start`(inclusive) to `group_index_end`(exclusive) in a batched ancestor matching job. See :ref:`large scale inference<sec_large_scale> for more details.
A tree sequence file for group_index_start - 1 must exist in work_dir, unless group_index_start is 0. After matching the tree sequence for group_index_end - 1 is written to work_dir.
- Parameters:
work_dir (str) – The working directory for the batch job, as written by
match_ancestors_batch_init()
.group_index_start (int) – The first group index to match.
group_index_end (int) – The group index to stop matching at.
num_threads (int) – The number of worker threads to use. If this is <= 1 then match sequentially.
- Returns:
The tree sequence representing the inferred ancestors for the last group matched
- Return type:
- tsinfer.match_ancestors_batch_group_partition(work_dir, group_index, partition_index)[source]#
Match a single partition of ancestors from a group in a batched ancestor matching job. See large scale inference for more details. The tree sequence for the group before must exist in work_dir. After matching the results for the partition are written to work_dir. Once all partitions for a group have been matched, the group can be finalised using
match_ancestors_batch_group_finalise()
. The number of partitions in a group is recorded in metadata.json in the work dir under the ancestor_grouping key. This method uses a single thread.- Parameters:
work_dir (str) – The working directory for the batch job, as written by
match_ancestors_batch_init()
.group_index (int) – The group index that contains the partition to match.
partition_index (int) – The partition index to match. Must be less than the number of partitions in the batch job metadata for this group.
- tsinfer.match_ancestors_batch_group_finalise(work_dir, group_index)[source]#
Finalise a group of partitioned ancestors in a batched ancestor matching job. See large scale inference for more details. The tree sequence for the group before must exist in work_dir, along with the results for all partitions in this group. Writes the tree sequence for the group to work_dir.
- Parameters:
work_dir (str) – The working directory for the batch job, as written by
match_ancestors_batch_init()
.group_index (int) – The group index to finalise.
- Returns:
The tree sequence representing the inferred ancestors for the group
- Return type:
- tsinfer.match_ancestors_batch_finalise(work_dir)[source]#
Finalise a batched ancestor matching job. This method should be called after all groups have been matched, either by
match_ancestors_batch_groups()
ormatch_ancestors_batch_group_finalise()
. Returns the final ancestors tree sequence for the batch job. work_dir is retained and not deleted.- Parameters:
work_dir (str) – The working directory for the batch job, as written by
match_ancestors_batch_init()
.- Returns:
The tree sequence representing the inferred ancestors for the batch job
- Return type:
- tsinfer.match_samples_batch_init(work_dir, variant_data_path, ancestral_state, ancestor_ts_path, min_work_per_job, *, sample_mask=None, site_mask=None, sites_time=None, individuals_time=None, individuals_location=None, individuals_population=None, individuals_flags=None, sequence_length=None, recombination_rate=None, mismatch_ratio=None, path_compression=True, indexes=None, post_process=None, force_sample_times=False, overlay_non_inference_sites=None, recombination=None, mismatch=None, precision=None, extended_checks=False, engine='C', record_provenance=True)[source]#
match_samples_batch_init(work_dir, variant_data_path, ancestral_state, ancestor_ts_path, min_work_per_job, *, sample_mask=None, site_mask=None, recombination_rate=None, mismatch_ratio=None, path_compression=True, indexes=None, post_process=None, force_sample_times=False)
Initialise a batched sample matching job. Creates work_dir and writes job details to metadata.json. The job can then be run using parallel calls to
match_samples_batch_partition()
and once those are complete finallymatch_samples_batch_finalise()
.The num_partitions key in the metadata dict contains the number of partitions that need to be processed.
- Parameters:
work_dir (str) – The directory in which to store the working files.
variant_data_path (str) –
The input dataset in VCF Zarr format. Path to the Zarr dataset saved on disk. See
VariantData
.ancestral_state (Union(array, str)) – A numpy array of strings specifying the ancestral states (alleles) used in inference. This must be the same length as the number of unmasked sites in the dataset. Alternatively, a single string can be provided, giving the name of an array in the input dataset which contains the ancestral states. Unknown ancestral states can be specified using “N”. Any ancestral states which do not match any of the known alleles at that site, will be tallied, and a warning issued summarizing the unknown ancestral states.
ancestor_ts_path (str) – The path to the tree sequence file containing the ancestors generated by
match_ancestors_batch_finalise()
, ormatch_ancestors()
.min_work_per_job (int) – The minimum amount of work (as a count of genotypes) to allocate to a single parallel job. If the amount of work in a group of samples exceeds this level it will be broken up into parallel partitions, subject to the constraint of max_num_partitions.
sample_mask (Union(array, str)) – A numpy array of booleans specifying which samples to mask out (exclude) from the dataset. Alternatively, a string can be provided, giving the name of an array in the input dataset which contains the sample mask. If
None
(default), all samples are included.site_mask (Union(array, str)) – A numpy array of booleans specifying which sites to mask out (exclude) from the dataset. Alternatively, a string can be provided, giving the name of an array in the input dataset which contains the site mask. If
None
(default), all sites are included.sites_time (Union(array, str)) – A numpy array of floats specifying the relative time of occurrence of the mutation to the derived state at each site. This must be of the same length as the number of unmasked sites. Alternatively, a string can be provided, giving the name of an array in the input dataset which contains the site times. If
None
(default), the frequency of the derived allele is used as a proxy for the time of occurrence: this is usually a reasonable approximation to the relative order of ancestors used for inference. Time values are ignored for sites not used in inference, such as singletons, sites with more than two alleles, or sites with an unknown ancestral state.individuals_time (Union(array, str)) – A numpy array of floats specifying the time of each individual in the dataset. This must be the same length as the number of unmasked individuals. Alternatively, a string can be provided, giving the name of an array in the input dataset which contains the individual times. If
None
(default), individuals are assumed to havetskit.UNKNOWN_TIME
.individuals_location (Union(array, str)) – A numpy array specifying the location of each individual in the dataset. This must be the same length as the number of unmasked individuals. Alternatively, a string can be provided, giving the name of an array in the input dataset which contains the individual locations. If
None
(default), individuals are assumed to have empty location arrays.individuals_population (Union(array, str)) – A numpy array of integers specifying the population of each individual in the dataset. This must be the same length as the number of unmasked individuals. Alternatively, a string can be provided, giving the name of an array in the input dataset which contains the individual populations. If
None
(default), individuals are assumed to havetskit.NULL
as their population.individuals_flags (Union(array, str)) – A numpy array of integers specifying the flags of each individual in the dataset. This must be the same length as the number of unmasked individuals. Alternatively, a string can be provided, giving the name of an array in the input dataset which contains the individual flags. If
None
(default), individuals are assumed to have flags set to 0.sequence_length (int) – An integer specifying the resulting sequence_length attribute of the output tree sequence. If not specified the contig_length attribute from the undelying zarr store for the contig of the selected variants. is used. If that is not present then the maximum position plus one of the used variants is used.
recombination_rate (float, msprime.RateMap) – Either a floating point value giving a constant rate \(\rho\) per unit length of genome, or an
msprime.RateMap
object. This is used to calculate the probability of recombination between adjacent sites. IfNone
, all matching conflicts are resolved by recombination and all inference sites will have a single mutation (equivalent to mismatch_ratio near zero)mismatch_ratio (float) – The probability of a mismatch relative to the median probability of recombination between adjacent sites: can only be used if a recombination rate has been set (default:
None
treated as 1 ifrecombination_rate
is set).path_compression (bool) – Whether to merge edges that share identical paths (essentially taking advantage of shared recombination breakpoints).
indexes (arraylike) – The sample indexes to match. If
None
(default), all samples are matched.post_process (bool) – Whether to run the
post_process()
method on the the tree sequence which, among other things, removes ancestral material that does not end up in the current samples (if not specified, defaults toTrue
)force_sample_times (bool) – After matching, should an attempt be made to adjust the time of “historical samples” (those associated with an individual having a non-zero time) such that the sample nodes in the tree sequence appear at the time of the individual with which they are associated.
overlay_non_inference_sites (bool) – If True, sites that were included in the selected sites, but were not used for inference, will be added to the tree sequence by mapping their mutations over the inferred topology. Defaults to True.
- Returns:
A dictionary of the job metadata, as written to metadata.json in work_dir.
- tsinfer.match_samples_batch_partition(work_dir, partition_index)[source]#
Match a single partition of samples in a batched sample matching job. See large scale inference for more details. Match data for the partition is written to work_dir. Uses a single thread to perform matching.
- Parameters:
work_dir (str) – The working directory for the batch job, as written by
match_samples_batch_init()
.partition_index (int) – The partition index to match. Must be less than the number of partitions in the batch job metadata key num_partitions.
- tsinfer.match_samples_batch_finalise(work_dir)[source]#
Finalise a batched sample matching job. This method should be called after all partitions have been matched by
match_samples_batch_partition()
. Returns the final tree sequence for the batch job. work_dir is retained and not deleted.- Parameters:
work_dir (str) – The working directory for the batch job, as written by
match_samples_batch_init()
.- Returns:
The tree sequence representing the inferred history of the samples.
- Return type: