API Documentation#

Variant data#

class tsinfer.VariantData(path_or_zarr, ancestral_state, *, sample_mask=None, site_mask=None, sites_time=None, individuals_time=None, individuals_location=None, individuals_population=None, individuals_flags=None, sequence_length=None)[source]#

Class representing input variant data used for inference. This is mostly a thin wrapper for a Zarr dataset storing information in the VCF Zarr (.vcz) format, plus information specifing the ancestral allele and (optional) data masks. It then provides various derived properties and methods for accessing the data in a form suitable for inference.

Note

In the VariantData object, “samples” refer to the individuals in the dataset, each of which can be of arbitrary ploidy. This is in contrast to tskit, in which each haploid genome is treated as a separate “sample”. For example in a diploid dataset, the inferred tree sequence returned at the end of the inference process will have inferred_ts.num_samples equal to double the number returned by VariantData.num_samples.

Parameters:
  • path_or_zarr (Union(str, zarr.Group)) – The input dataset in VCF Zarr format. This can either a path to the Zarr dataset saved on disk, or the Zarr object itself.

  • ancestral_state (Union(array, str)) – A numpy array of strings specifying the ancestral states (alleles) used in inference. This must be the same length as the number of unmasked sites in the dataset. Alternatively, a single string can be provided, giving the name of an array in the input dataset which contains the ancestral states. Unknown ancestral states can be specified using “N”. Any ancestral states which do not match any of the known alleles at that site, will be tallied, and a warning issued summarizing the unknown ancestral states. Note that allelic states are case-sensitive in tsinfer.

  • sample_mask (Union(array, str)) – A numpy array of booleans specifying which samples to mask out (exclude) from the dataset. Alternatively, a string can be provided, giving the name of an array in the input dataset which contains the sample mask. If None (default), all samples are included.

  • site_mask (Union(array, str)) – A numpy array of booleans specifying which sites to mask out (exclude) from the dataset. Alternatively, a string can be provided, giving the name of an array in the input dataset which contains the site mask. If None (default), all sites are included.

  • sites_time (Union(array, str)) – A numpy array of floats specifying the relative time of occurrence of the mutation to the derived state at each site. This must be of the same length as the number of unmasked sites. Alternatively, a string can be provided, giving the name of an array in the input dataset which contains the site times. If None (default), the frequency of the derived allele is used as a proxy for the time of occurrence: this is usually a reasonable approximation to the relative order of ancestors used for inference. Time values are ignored for sites not used in inference, such as singletons, sites with more than two alleles, or sites with an unknown ancestral state.

  • individuals_time (Union(array, str)) – A numpy array of floats specifying the time of each individual in the dataset. This must be the same length as the number of unmasked individuals. Alternatively, a string can be provided, giving the name of an array in the input dataset which contains the individual times. If None (default), individuals are assumed to have tskit.UNKNOWN_TIME.

  • individuals_location (Union(array, str)) – A numpy array specifying the location of each individual in the dataset. This must be the same length as the number of unmasked individuals. Alternatively, a string can be provided, giving the name of an array in the input dataset which contains the individual locations. If None (default), individuals are assumed to have empty location arrays.

  • individuals_population (Union(array, str)) – A numpy array of integers specifying the population of each individual in the dataset. This must be the same length as the number of unmasked individuals. Alternatively, a string can be provided, giving the name of an array in the input dataset which contains the individual populations. If None (default), individuals are assumed to have tskit.NULL as their population.

  • individuals_flags (Union(array, str)) – A numpy array of integers specifying the flags of each individual in the dataset. This must be the same length as the number of unmasked individuals. Alternatively, a string can be provided, giving the name of an array in the input dataset which contains the individual flags. If None (default), individuals are assumed to have flags set to 0.

  • sequence_length (int) – An integer specifying the resulting sequence_length attribute of the output tree sequence. If not specified the contig_length attribute from the undelying zarr store for the contig of the selected variants. is used. If that is not present then the maximum position plus one of the used variants is used.

tsinfer.add_ancestral_state_array(zarr_group, fasta_string, *, array_name='ancestral_state')[source]#

Add an ancestral state array to a zarr group from a string of nucleotides representing the ancestral sequence.

Parameters:
  • zarr_group (zarr.Group) – A zarr group to add the ancestral state array to. This should be a VCF Zarr format group containing a “variant_position” array.

  • fasta_string (str) – A string containing the ancestral sequence, from e.g. FASTA.

  • array_name (str) – The name of the array to create in the zarr group. Default is “ancestral_state”.

Returns:

The newly inserted ancestral state array.

Return type:

zarr.Array

Ancestor data#

tsinfer.load(path)[source]#

Loads a tsinfer AncestorData file from the specified path. If the file is format not recognised a FileFormatError will be thrown.

Parameters:

path (str) – The path of the file we wish to load.

Returns:

The corresponding AncestorData instance opened in read only mode.

Return type:

AncestorData.

Raises:

FileFormatError if the file cannot be read.

class tsinfer.AncestorData(position, sequence_length, *, path=None, num_flush_threads=0, compressor=None, chunk_size=1024, max_file_size=None)[source]#

Class representing the stored ancestor data produced by generate_ancestors(). See the ancestor data file format specifications for details on the structure of this file.

Note

If a path is specified, the max_file_size option puts an upper limit on the possible size of the created file. On non-Windows systems, space for this file is not immediately allocated but just “reserved” using sparse file systems. However, on Windows systems the file is allocated immediately, so max_file_size takes a smaller default value, to avoid allocating very large files for no reason. Users who wish to run large inferences on Windows may therefore need to explictly set an appropriate max_file_size. Note that the max_file_size is only used while the file is being built: one the file has been finalised, it is shrunk to its minimum size.

Parameters:
  • position (arraylike) – Integer array of the site positions of the ancestors. All values should be >0 and the array should be monotonically increasing.

  • sequence_length (float) – Total length of the sequence, site positions must be less than this value.

  • path (str) – The path of the file to store the ancestor data. If None, the information is stored in memory and not persistent.

  • num_flush_threads (int) – The number of background threads to use for compressing data and flushing to disc. If <= 0, do not spawn any threads but use a synchronous algorithm instead. Default=0.

  • compressor (numcodecs.abc.Codec) – A numcodecs.abc.Codec instance to use for compressing data. Any codec may be used, but problems may occur with very large datasets on certain codecs as they cannot compress buffers >2GB. If None, do not use any compression. Default=:class:numcodecs.zstd.Zstd.

  • chunk_size (int) – The chunk size used for zarr arrays in the sample dimension. This affects compression level and algorithm performance. Default=1024.

  • chunk_size_sites (int) –

    The chunk size used for the genotype zarr arrays in the sites dimension. This affects compression level and algorithm performance. Default=16384.

  • max_file_size (int) – If a file is being used to store this data, set a maximum size in bytes for the stored file. If None, the default value of 1GiB is used on Windows and 1TiB on other platforms (see above for details).

data_equal(other)[source]#

Returns True if all the data attributes of this input file and the specified input file are equal. This compares every attribute.

property sequence_length#

Returns the sequence length.

property num_sites#

The number of inference sites used to generate the ancestors

property sites_position#

The positions of the inference sites used to generate the ancestors

add_provenance(timestamp, record)#

Adds a new provenance record with the specified timestamp and record. Timestamps should ISO8601 formatted, and record is some JSON encodable object.

arrays()#

Returns a list of all the zarr arrays in this DataContainer.

clear_provenances()#

Clear all provenances in this instance

close()#

Close this DataContainer. Any read or write operations attempted after calling this will fail.

copy(path=None, max_file_size=None)#

Returns a copy of this DataContainer opened in ‘edit’ mode. If path is specified, this must not be equal to the path of the current data container.

property file_size#

Returns the size of the underlying file, or -1 if we do not have a file associated.

property info#

Returns a string containing the zarr info for each array.

provenances()#

Returns an iterator over the (timestamp, record) pairs representing the provenances for this data container.

record_provenance(command=None, resources=None, **kwargs)#

Records the provenance information for this file using the tskit provenances schema.

property ancestors_length#

Returns the length of ancestors in physical coordinates.

insert_proxy_samples(variant_data, *, sample_ids=None, epsilon=None, allow_mutation=False, **kwargs)[source]#

Take a set of samples from a VariantData instance and create additional “proxy sample ancestors” from them, returning a new AncestorData instance including both the current ancestors and the additional ancestors at the appropriate time points.

A proxy sample ancestor is an ancestor based upon a known sample. At sites used in the full inference process, the haplotype of this ancestor is identical to that of the sample on which it is based. The time of the ancestor is taken to be a fraction epsilon older than the sample on which it is based.

A common use of this function is to provide ancestral nodes for anchoring historical samples at the correct time when matching them into a tree sequence during the tsinfer.match_samples() stage of inference. For this reason, by default, the samples chosen from sample_data are those associated with historical (i.e. non-contemporary) individuals. This can be altered by using the sample_ids parameter.

Note

The proxy sample ancestors inserted here will correspond to extra nodes in the inferred tree sequence. At sites which are not used in the full inference process (e.g. sites unique to a single historical sample), these proxy sample ancestor nodes may have a different genotype from their corresponding sample.

Parameters:
  • variant_data (VariantData) – The VariantData instance from which to select the samples used to create extra ancestors.

  • sample_ids (list(int)) – A list of sample ids in the variant_data instance that will be selected to create the extra ancestors. If None (default) select all the historical samples, i.e. those associated with an Individual whose time is greater than zero. The order of ids is ignored, as are duplicate ids.

  • epsilon (list(float)) – An list of small time increments determining how much older each proxy sample ancestor is than the corresponding sample listed in sample_ids. A single value is also allowed, in which case it is used as the time increment for all selected proxy sample ancestors. If None (default) find \({\delta}t\), the smallest time difference between between the sample times and the next oldest ancestor in the current AncestorData instance, setting epsilon = \({\delta}t / 100\) (or, if all selected samples are at least as old as the oldest ancestor, take \({\delta}t\) to be the smallest non-zero time difference between existing ancestors).

  • allow_mutation (bool) – If False (the default), any site in a proxy sample ancestor that has a derived allele must have a pre-existing mutation in an older (non-proxy) ancestor, otherwise an error is raised. Alternatively, if allow_mutation is True, proxy ancestors can contain a de-novo mutation at a site that also has a mutation elsewhere (i.e. breaking the infinite sites assumption), allowing them to possess derived alleles at sites where there are no pre-existing mutations in older ancestors.

  • **kwargs – Further arguments passed to the constructor when creating the new AncestorData instance which will be returned.

Returns:

A new AncestorData object.

Return type:

AncestorData

truncate_ancestors(lower_time_bound, upper_time_bound, length_multiplier=2, buffer_length=1000, **kwargs)[source]#

Truncates the length of ancestors above a given time and returns a new AncestorData instance.

Given a set of haplotypes H such that lower_time_bound <= h.time < upper_time_bound, we let max_len = length_multiplier * max(max(h.length) for h in H). Then, we truncate all haplotypes containing at least one focal site where h.time >= upper, ensuring these haplotypes extend no further than half of max_len to the either side of the leftmost and rightmost focal sites of the ancestral haplotype. Note that ancestors above upper_time_bound may still be longer than max_len if the ancestor contains greater than 2 focal sites.

This function should be used when tsinfer.generate_ancestors() generates old ancestors which are very long, as these can significantly slow down matching time. Older ancestors should generally be shorter than younger ancestors, so truncating the lengths of older ancestors has negligible effect on inference accuracy.

Note

Please ensure that the time values provided to lower_time_bound and upper_time_bound match the units used in the AncestorData file, i.e. if your ancestors do not have site times specified, upper_time_bound should be between 0 and 1.

Parameters:
  • lower_time_bound (float) – Defines the lower bound (inclusive) of the half open interval where we search for a truncation value.

  • upper_time_bound (float) – Defines the upper bound (exclusive) of the half open interval where we search for a truncation value. The truncation value is the length of the longest haplotype in this interval multiplied by length_multiplier. The length of ancestors as old or older than upper_time_bound will be truncated using this value.

  • length_multiplier (float) – A multiplier for the length of the longest ancestor in the half-open interval between lower_time_bound (inclusive) and uppper_time_bound (exclusive), i.e. if the longest ancestor in the interval is 1 megabase, a length_multiplier of 2 creates a maximum length of 2 megabases.

  • buffer_length (int) – The number of changed ancestors to buffer before writing to disk.

  • **kwargs – Further arguments passed to the AncestorData.copy() when creating the new AncestorData instance which will be returned.

Returns:

A new AncestorData object.

Return type:

AncestorData

add_ancestor(start, end, time, focal_sites, haplotype)[source]#

Adds an ancestor with the specified haplotype, with ancestral material over the interval [start:end], that is associated with the specified timepoint and has new mutations at the specified list of focal sites. Ancestors should be added in time order, with the oldest first. The id of the added ancestor is returned.

finalise()[source]#

Ensures that the state of the data is flushed and writes the provenance for the current operation. The specified ‘command’ is used to fill the corresponding entry in the provenance dictionary.

ancestor(id_)[source]#

Returns the ancestor with the specified ID.

Return type:

Ancestor

ancestors(indexes=None)[source]#

Returns an iterator over all the ancestors. If indexes is provided, it should be a sorted list of indexes giving a subset of ancestors to return. For efficiency, the indexes should be a numpy integer array.

Todo

  1. Add documentation for the data attributes in read-mode.

Running inference#

tsinfer.infer(variant_data, *, recombination_rate=None, mismatch_ratio=None, path_compression=True, exclude_positions=None, post_process=None, num_threads=0)[source]#

Runs the full inference pipeline on the specified VariantData instance and returns the inferred tskit.TreeSequence. See matching ancestors & samples in the documentation for details of recombination_rate, mismatch_ratio and path_compression.

Note

For finer grained control over inference, for example to set mismatch ratios when matching ancestors as well as when matching samples, run tsinfer.generate_ancestors(), tsinfer.match_ancestors() and tsinfer.match_samples() separately.

Parameters:
  • variant_data (VariantData) – The input VariantData instance representing the observed data that we wish to make inferences from.

  • recombination_rate (float, msprime.RateMap) – Either a floating point value giving a constant rate \(\rho\) per unit length of genome, or an msprime.RateMap object. This is used to calculate the probability of recombination between adjacent sites in the match_samples stage. If None, all matching conflicts are resolved by recombination and all inference sites will have a single mutation (equivalent to mismatch_ratio near zero).

  • mismatch_ratio (float) – The probability of a mismatch relative to the median probability of recombination between adjacent sites: can only be used if a recombination rate has been set (default: None treated as 1 if recombination_rate is set). This is only applied in the match_samples stage.

  • path_compression (bool) – Whether to merge edges that share identical paths (essentially taking advantage of shared recombination breakpoints).

  • post_process (bool) – Whether to run the post_process() method on the the tree sequence which, among other things, removes ancestral material that does not end up in the current samples (if not specified, defaults to True)

  • exclude_positions (array_like) – A list of site positions to exclude for full inference. Sites with these positions will not be used to generate ancestors, and not used during the copying process. Any such sites that exist in the sample data file will be included in the trees after the main inference process using parsimony. The list does not need to be in to be in any particular order, and can include site positions that are not present in the sample data file.

  • num_threads (int) – The number of worker threads to use in parallelised sections of the algorithm. If <= 0, do not spawn any threads and use simpler sequential algorithms (default).

  • simplify (bool) – When post_processing, only simplify the tree sequence. deprecated but retained for backwards compatibility (default: None).

Returns:

The tskit.TreeSequence object inferred from the input sample data.

Return type:

tskit.TreeSequence

tsinfer.generate_ancestors(variant_data, *, path=None, exclude_positions=None, num_threads=0, genotype_encoding=None, mmap_temp_dir=None, **kwargs)[source]#

Runs the ancestor generation algorithm on the specified VariantData instance and returns the resulting AncestorData instance. If you wish to store generated ancestors persistently on file you must pass the path keyword argument to this function. For example,

ancestor_data = tsinfer.generate_ancestors(variant_data, path="mydata.ancestors")

Other keyword arguments are passed to the AncestorData constructor, which may be used to control the storage properties of the generated file.

Ancestor generation involves loading the entire genotype matrix into memory, by default using one byte per haploid genotype, which can be prohibitively large when working with sample sizes of 100,000 or more. There are two options to help mitigate memory usage. The genotype_encoding parameter allows the user to specify a more compact encoding scheme, which reduces storage space for datasets with small numbers of alleles. Currently, the GenotypeEncoding.ONE_BIT encoding is supported, which provides 8-fold compression of biallelic, non-missing data. An error is raised if an encoding that does not support the range of values present in a given dataset is provided.

The second option for reducing the RAM footprint of this function is to use the mmap_temp_dir parameter. This allows the genotype data to be cached on file, transparently using the operating system’s virtual memory subsystem to swap in and out the data. This can work well if the encoded genotype matrix almost fits in RAM and fast local storage is available. However, if the size of the encoded genotype matrix is more than, say, twice the available RAM it is unlikely that this function will complete in a reasonable time-frame. A temporary file is created in the specified mmap_temp_dir, which is automatically deleted when the function completes.

Warning

The mmap_temp_dir option is a silent no-op on Windows!

Parameters:
  • variant_data (VariantData) – The VariantData instance that we are generating putative ancestors from.

  • path (str) – The path of the file to store the ancestor data. If None, the information is stored in memory and not persistent.

  • exclude_positions (array_like) – A list of site positions to exclude for full inference. Sites with these positions will not be used to generate ancestors, and not used during the copying process. The list does not need be in any particular order.

  • num_threads (int) – The number of worker threads to use. If < 1, use a simpler synchronous algorithm.

  • genotype_encoding (int) – The encoding to use for genotype data internally when generating ancestors. See the GenotypeEncoding class for the available options. Defaults to one-byte per genotype.

  • mmap_temp_dir (str) – The directory within which to create the temporary backing file when using mmaped memory for bulk genotype storage. If None (the default) allocate memory directly using the standard mechanism. This is an advanced option, usually only relevant when working with very large datasets (see above for more information).

Returns:

The inferred ancestors stored in an AncestorData instance.

Return type:

AncestorData

class tsinfer.GenotypeEncoding(value)[source]#

The encoding scheme used to store genotypes.

EIGHT_BIT = 0#

The default approach of using one-byte per genotype. Supports up to 127 alleles and missing data.

ONE_BIT = 1#

Encode binary genotype data using a single bit.

tsinfer.match_ancestors(variant_data, ancestor_data, *, recombination_rate=None, mismatch_ratio=None, path_compression=True, num_threads=0)[source]#

Run the ancestor matching algorithm on the specified VariantData and AncestorData instances, returning the resulting tskit.TreeSequence representing the complete ancestry of the putative ancestors. See matching ancestors & samples in the documentation for details of recombination_rate, mismatch_ratio and path_compression.

Parameters:
  • variant_data (VariantData) – The VariantData instance representing the input data.

  • ancestor_data (AncestorData) – The AncestorData instance representing the set of ancestral haplotypes for which we are finding a history.

  • recombination_rate (float, msprime.RateMap) – Either a floating point value giving a constant rate \(\rho\) per unit length of genome, or an msprime.RateMap object. This is used to calculate the probability of recombination between adjacent sites. If None, all matching conflicts are resolved by recombination and all inference sites will have a single mutation (equivalent to mismatch_ratio near zero)

  • mismatch_ratio (float) – The probability of a mismatch relative to the median probability of recombination between adjacent sites: can only be used if a recombination rate has been set (default: None treated as 1 if recombination_rate is set).

  • path_compression (bool) – Whether to merge edges that share identical paths (essentially taking advantage of shared recombination breakpoints).

  • num_threads (int) – The number of match worker threads to use. If this is <= 0 then a simpler sequential algorithm is used (default).

Returns:

The ancestors tree sequence representing the inferred history of the set of ancestors.

Return type:

tskit.TreeSequence

tsinfer.match_samples(variant_data, ancestors_ts, *, recombination_rate=None, mismatch_ratio=None, path_compression=True, post_process=None, indexes=None, force_sample_times=False, num_threads=0)[source]#

Runs the sample matching algorithm on the specified VariantData instance and ancestors tree sequence, returning the final tskit.TreeSequence instance containing the full inferred history for all samples and sites. See matching ancestors & samples in the documentation for details of recombination_rate, mismatch_ratio and path_compression.

Parameters:
  • variant_data (VariantData) – The VariantData instance representing the input data.

  • ancestors_ts (tskit.TreeSequence) – The tskit.TreeSequence instance representing the inferred history among ancestral ancestral haplotypes.

  • recombination_rate (float, msprime.RateMap) – Either a floating point value giving a constant rate \(\rho\) per unit length of genome, or an msprime.RateMap object. This is used to calculate the probability of recombination between adjacent sites. If None, all matching conflicts are resolved by recombination and all inference sites will have a single mutation (equivalent to mismatch_ratio near zero)

  • mismatch_ratio (float) – The probability of a mismatch relative to the median probability of recombination between adjacent sites: can only be used if a recombination rate has been set (default: None treated as 1 if recombination_rate is set).

  • path_compression (bool) – Whether to merge edges that share identical paths (essentially taking advantage of shared recombination breakpoints).

  • indexes (array_like) – An array of indexes into the variant_data file of the samples to match (in increasing order) or None for all samples.

  • post_process (bool) – Whether to run the post_process() method on the the tree sequence which, among other things, removes ancestral material that does not end up in the current samples (if not specified, defaults to True)

  • force_sample_times (bool) – After matching, should an attempt be made to adjust the time of “historical samples” (those associated with an individual having a non-zero time) such that the sample nodes in the tree sequence appear at the time of the individual with which they are associated.

  • num_threads (int) – The number of match worker threads to use. If this is <= 0 then a simpler sequential algorithm is used (default).

  • simplify (bool) – Treated as an alias for post_process, deprecated but currently retained for backwards compatibility if set to False.

  • overlay_non_inference_sites (bool) – If True, sites that were included in the selected sites, but were not used for inference, will be added to the tree sequence by mapping their mutations over the inferred topology. Defaults to True.

Returns:

The tree sequence representing the inferred history of the sample.

Return type:

tskit.TreeSequence

tsinfer.augment_ancestors(variant_data, ancestors_ts, indexes, *, recombination_rate=None, mismatch_ratio=None, path_compression=True, num_threads=0)[source]#

Runs the sample matching algorithm on the specified VariantData instance and ancestors tree sequence, for the specified subset of sample indexes, returning the tskit.TreeSequence instance including these samples. This tree sequence can then be used as an ancestors tree sequence for subsequent matching against all samples. See matching ancestors & samples in the documentation for details of recombination_rate, mismatch_ratio and path_compression.

Parameters:
  • variant_data (VariantData) – The VariantData instance representing the input data.

  • ancestors_ts (tskit.TreeSequence) – The tskit.TreeSequence instance representing the inferred history among ancestral ancestral haplotypes.

  • indexes (array) – The sample indexes to insert into the ancestors tree sequence, in increasing order.

  • recombination_rate (float, msprime.RateMap) – Either a floating point value giving a constant rate \(\rho\) per unit length of genome, or an msprime.RateMap object. This is used to calculate the probability of recombination between adjacent sites. If None, all matching conflicts are resolved by recombination and all inference sites will have a single mutation (equivalent to mismatch_ratio near zero)

  • mismatch_ratio (float) – The probability of a mismatch relative to the median probability of recombination between adjacent sites: can only be used if a recombination rate has been set (default: None treated as 1 if recombination_rate is set).

  • path_compression (bool) – Whether to merge edges that share identical paths (essentially taking advantage of shared recombination breakpoints).

  • num_threads (int) – The number of match worker threads to use. If this is <= 0 then a simpler sequential algorithm is used (default).

Returns:

The specified ancestors tree sequence augmented with copying paths for the specified sample.

Return type:

tskit.TreeSequence

tsinfer.post_process(ts, *, split_ultimate=None, erase_flanks=None)[source]#

Post-process a tsinferred tree sequence into a more conventional form. This is the function run by default on the final tree sequence output by match_samples(). It involves the following 4 steps:

  1. If the oldest node is connected to a single child via an edge that spans the entire tree sequence, this oldest node is removed, so that its child becomes the new root (this step is undertaken to remove the “virtual-root-like node” which is added to ancestor tree sequences to enable matching).

  2. If the oldest node is removed in the first step and the new root spans the entire genome, it is treated as the “ultimate ancestor” and (unless split_ultimate is False) the node is split into multiple coexisiting nodes with the splits occurring whenever the children of the ultimate ancestor change. The rationale is that tsinfer creates a single ancestral haplotype with all inference sites in the ancestral state: this is, however, unlikely to represent a single ancestor in the past. If the tree sequence is then dated, the fact that ultimate ancestor is split into separate nodes allows these nodes to be dated to different times.

  3. Often, extensive regions of genome exist before the first defined site and after the last defined site. Since no data exists in these sections of the genome, post processing by default erases the inferred topology in these regions. However, if erase_flanks is False, the flanking regions at the start and end will be assigned the same topology as inferred at the first and last site respectively.

  4. The sample nodes are reordered such that they are the first nodes listed in the node table, removing tree nodes and edges that are not on a path between the root and any of the samples (by applying the simplify() method with keep_unary set to True but filter_sites, filter_populations and filter_individuals set to False).

Parameters:
  • split_ultimate (bool) – If True (default) and the oldest node is the only parent to a single “ultimate ancestor” node, attempt to split this node into many separate nodes (see above). If False do not attempt to identify or split an ultimate ancestor node.

  • erase_flanks (bool) – If True (default), keep only the inferred topology between the first and last sites. If False, output regions of topology inferred before the first site and after the last site.

Returns:

The post-processed tree sequence.

Return type:

tskit.TreeSequence

Batched inference#

tsinfer.match_ancestors_batch_init(work_dir, variant_data_path, ancestral_state, ancestor_data_path, min_work_per_job, *, max_num_partitions=None, sample_mask=None, site_mask=None, sequence_length=None, recombination_rate=None, mismatch_ratio=None, path_compression=True, recombination=None, mismatch=None, precision=None, engine='C', extended_checks=False, time_units=None, record_provenance=True)[source]#

match_ancestors_batch_init(work_dir, variant_data_path, ancestral_state, ancestor_data_path, min_work_per_job, *, max_num_partitions=None, sample_mask=None, site_mask=None, recombination_rate=None, mismatch_ratio=None, path_compression=True)

Initialise a batched ancestor matching job. This function is used to prepare a working directory for running a batched ancestor matching job. The job is split into groups of ancestors, with each group further split into partitions of ancestors if necessary. work_dir is created and details are written to metadata.json in work_dir. The job can then be run using match_ancestors_batch_groups() and match_ancestors_batch_group_partition() then finally match_ancestors_batch_group_finalise(). See large scale inference for more details about how these methods work together. See match_ancestors() for details on ancestor matching.

Parameters:
  • work_dir (str) – The directory in which to store the working files.

  • variant_data_path (str) –

    The input dataset in VCF Zarr format. Path to the Zarr dataset saved on disk. See VariantData.

  • ancestral_state (Union(array, str)) – A numpy array of strings specifying the ancestral states (alleles) used in inference. This must be the same length as the number of unmasked sites in the dataset. Alternatively, a single string can be provided, giving the name of an array in the input dataset which contains the ancestral states. Unknown ancestral states can be specified using “N”. Any ancestral states which do not match any of the known alleles at that site, will be tallied, and a warning issued summarizing the unknown ancestral states.

  • ancestor_data_path (str) – The path to the file containing the ancestors generated by generate_ancestors().

  • min_work_per_job (int) – The minimum amount of work (as a count of genotypes) to allocate to a single parallel job. If the amount of work in a group of ancestors exceeds this level it will be broken up into parallel partitions, subject to the constraint of max_num_partitions.

  • max_num_partitions (int) – The maximum number of partitions to split a group of ancestors into. Useful for limiting the number of jobs in a workflow to avoid job overhead. Defaults to 1000.

  • sample_mask (Union(array, str)) – A numpy array of booleans specifying which samples to mask out (exclude) from the dataset. Alternatively, a string can be provided, giving the name of an array in the input dataset which contains the sample mask. If None (default), all samples are included.

  • site_mask (Union(array, str)) – A numpy array of booleans specifying which sites to mask out (exclude) from the dataset. Alternatively, a string can be provided, giving the name of an array in the input dataset which contains the site mask. If None (default), all sites are included.

  • sequence_length (int) – An integer specifying the resulting sequence_length attribute of the output tree sequence. If not specified the contig_length attribute from the undelying zarr store for the contig of the selected variants. is used. If that is not present then the maximum position plus one of the used variants is used.

  • recombination_rate (float, msprime.RateMap) – Either a floating point value giving a constant rate \(\rho\) per unit length of genome, or an msprime.RateMap object. This is used to calculate the probability of recombination between adjacent sites. If None, all matching conflicts are resolved by recombination and all inference sites will have a single mutation (equivalent to mismatch_ratio near zero)

  • mismatch_ratio (float) – The probability of a mismatch relative to the median probability of recombination between adjacent sites: can only be used if a recombination rate has been set (default: None treated as 1 if recombination_rate is set).

  • path_compression (bool) – Whether to merge edges that share identical paths (essentially taking advantage of shared recombination breakpoints).

Returns:

A dictionary of the job metadata, as written to metadata.json in work_dir. ancestor_grouping in this dict contains the grouping of ancestors into groups and should be used to guide calling match_ancestors_batch_groups() and match_ancestors_batch_group_partition().

Return type:

dict

tsinfer.match_ancestors_batch_groups(work_dir, group_index_start, group_index_end, num_threads=0)[source]#

match_ancestors_batch_groups(work_dir, group_index_start, group_index_end, num_threads=0)

Match a set of ancestor groups from group_index_start`(inclusive) to `group_index_end`(exclusive) in a batched ancestor matching job. See :ref:`large scale inference<sec_large_scale> for more details.

A tree sequence file for group_index_start - 1 must exist in work_dir, unless group_index_start is 0. After matching the tree sequence for group_index_end - 1 is written to work_dir.

Parameters:
  • work_dir (str) – The working directory for the batch job, as written by match_ancestors_batch_init().

  • group_index_start (int) – The first group index to match.

  • group_index_end (int) – The group index to stop matching at.

  • num_threads (int) – The number of worker threads to use. If this is <= 1 then match sequentially.

Returns:

The tree sequence representing the inferred ancestors for the last group matched

Return type:

tskit.TreeSequence

tsinfer.match_ancestors_batch_group_partition(work_dir, group_index, partition_index)[source]#

Match a single partition of ancestors from a group in a batched ancestor matching job. See large scale inference for more details. The tree sequence for the group before must exist in work_dir. After matching the results for the partition are written to work_dir. Once all partitions for a group have been matched, the group can be finalised using match_ancestors_batch_group_finalise(). The number of partitions in a group is recorded in metadata.json in the work dir under the ancestor_grouping key. This method uses a single thread.

Parameters:
  • work_dir (str) – The working directory for the batch job, as written by match_ancestors_batch_init().

  • group_index (int) – The group index that contains the partition to match.

  • partition_index (int) – The partition index to match. Must be less than the number of partitions in the batch job metadata for this group.

tsinfer.match_ancestors_batch_group_finalise(work_dir, group_index)[source]#

Finalise a group of partitioned ancestors in a batched ancestor matching job. See large scale inference for more details. The tree sequence for the group before must exist in work_dir, along with the results for all partitions in this group. Writes the tree sequence for the group to work_dir.

Parameters:
Returns:

The tree sequence representing the inferred ancestors for the group

Return type:

tskit.TreeSequence

tsinfer.match_ancestors_batch_finalise(work_dir)[source]#

Finalise a batched ancestor matching job. This method should be called after all groups have been matched, either by match_ancestors_batch_groups() or match_ancestors_batch_group_finalise(). Returns the final ancestors tree sequence for the batch job. work_dir is retained and not deleted.

Parameters:

work_dir (str) – The working directory for the batch job, as written by match_ancestors_batch_init().

Returns:

The tree sequence representing the inferred ancestors for the batch job

Return type:

tskit.TreeSequence

tsinfer.match_samples_batch_init(work_dir, variant_data_path, ancestral_state, ancestor_ts_path, min_work_per_job, *, sample_mask=None, site_mask=None, sites_time=None, individuals_time=None, individuals_location=None, individuals_population=None, individuals_flags=None, sequence_length=None, recombination_rate=None, mismatch_ratio=None, path_compression=True, indexes=None, post_process=None, force_sample_times=False, overlay_non_inference_sites=None, recombination=None, mismatch=None, precision=None, extended_checks=False, engine='C', record_provenance=True)[source]#

match_samples_batch_init(work_dir, variant_data_path, ancestral_state, ancestor_ts_path, min_work_per_job, *, sample_mask=None, site_mask=None, recombination_rate=None, mismatch_ratio=None, path_compression=True, indexes=None, post_process=None, force_sample_times=False)

Initialise a batched sample matching job. Creates work_dir and writes job details to metadata.json. The job can then be run using parallel calls to match_samples_batch_partition() and once those are complete finally match_samples_batch_finalise().

The num_partitions key in the metadata dict contains the number of partitions that need to be processed.

Parameters:
  • work_dir (str) – The directory in which to store the working files.

  • variant_data_path (str) –

    The input dataset in VCF Zarr format. Path to the Zarr dataset saved on disk. See VariantData.

  • ancestral_state (Union(array, str)) – A numpy array of strings specifying the ancestral states (alleles) used in inference. This must be the same length as the number of unmasked sites in the dataset. Alternatively, a single string can be provided, giving the name of an array in the input dataset which contains the ancestral states. Unknown ancestral states can be specified using “N”. Any ancestral states which do not match any of the known alleles at that site, will be tallied, and a warning issued summarizing the unknown ancestral states.

  • ancestor_ts_path (str) – The path to the tree sequence file containing the ancestors generated by match_ancestors_batch_finalise(), or match_ancestors().

  • min_work_per_job (int) – The minimum amount of work (as a count of genotypes) to allocate to a single parallel job. If the amount of work in a group of samples exceeds this level it will be broken up into parallel partitions, subject to the constraint of max_num_partitions.

  • sample_mask (Union(array, str)) – A numpy array of booleans specifying which samples to mask out (exclude) from the dataset. Alternatively, a string can be provided, giving the name of an array in the input dataset which contains the sample mask. If None (default), all samples are included.

  • site_mask (Union(array, str)) – A numpy array of booleans specifying which sites to mask out (exclude) from the dataset. Alternatively, a string can be provided, giving the name of an array in the input dataset which contains the site mask. If None (default), all sites are included.

  • sites_time (Union(array, str)) – A numpy array of floats specifying the relative time of occurrence of the mutation to the derived state at each site. This must be of the same length as the number of unmasked sites. Alternatively, a string can be provided, giving the name of an array in the input dataset which contains the site times. If None (default), the frequency of the derived allele is used as a proxy for the time of occurrence: this is usually a reasonable approximation to the relative order of ancestors used for inference. Time values are ignored for sites not used in inference, such as singletons, sites with more than two alleles, or sites with an unknown ancestral state.

  • individuals_time (Union(array, str)) – A numpy array of floats specifying the time of each individual in the dataset. This must be the same length as the number of unmasked individuals. Alternatively, a string can be provided, giving the name of an array in the input dataset which contains the individual times. If None (default), individuals are assumed to have tskit.UNKNOWN_TIME.

  • individuals_location (Union(array, str)) – A numpy array specifying the location of each individual in the dataset. This must be the same length as the number of unmasked individuals. Alternatively, a string can be provided, giving the name of an array in the input dataset which contains the individual locations. If None (default), individuals are assumed to have empty location arrays.

  • individuals_population (Union(array, str)) – A numpy array of integers specifying the population of each individual in the dataset. This must be the same length as the number of unmasked individuals. Alternatively, a string can be provided, giving the name of an array in the input dataset which contains the individual populations. If None (default), individuals are assumed to have tskit.NULL as their population.

  • individuals_flags (Union(array, str)) – A numpy array of integers specifying the flags of each individual in the dataset. This must be the same length as the number of unmasked individuals. Alternatively, a string can be provided, giving the name of an array in the input dataset which contains the individual flags. If None (default), individuals are assumed to have flags set to 0.

  • sequence_length (int) – An integer specifying the resulting sequence_length attribute of the output tree sequence. If not specified the contig_length attribute from the undelying zarr store for the contig of the selected variants. is used. If that is not present then the maximum position plus one of the used variants is used.

  • recombination_rate (float, msprime.RateMap) – Either a floating point value giving a constant rate \(\rho\) per unit length of genome, or an msprime.RateMap object. This is used to calculate the probability of recombination between adjacent sites. If None, all matching conflicts are resolved by recombination and all inference sites will have a single mutation (equivalent to mismatch_ratio near zero)

  • mismatch_ratio (float) – The probability of a mismatch relative to the median probability of recombination between adjacent sites: can only be used if a recombination rate has been set (default: None treated as 1 if recombination_rate is set).

  • path_compression (bool) – Whether to merge edges that share identical paths (essentially taking advantage of shared recombination breakpoints).

  • indexes (arraylike) – The sample indexes to match. If None (default), all samples are matched.

  • post_process (bool) – Whether to run the post_process() method on the the tree sequence which, among other things, removes ancestral material that does not end up in the current samples (if not specified, defaults to True)

  • force_sample_times (bool) – After matching, should an attempt be made to adjust the time of “historical samples” (those associated with an individual having a non-zero time) such that the sample nodes in the tree sequence appear at the time of the individual with which they are associated.

  • overlay_non_inference_sites (bool) – If True, sites that were included in the selected sites, but were not used for inference, will be added to the tree sequence by mapping their mutations over the inferred topology. Defaults to True.

Returns:

A dictionary of the job metadata, as written to metadata.json in work_dir.

tsinfer.match_samples_batch_partition(work_dir, partition_index)[source]#

Match a single partition of samples in a batched sample matching job. See large scale inference for more details. Match data for the partition is written to work_dir. Uses a single thread to perform matching.

Parameters:
  • work_dir (str) – The working directory for the batch job, as written by match_samples_batch_init().

  • partition_index (int) – The partition index to match. Must be less than the number of partitions in the batch job metadata key num_partitions.

tsinfer.match_samples_batch_finalise(work_dir)[source]#

Finalise a batched sample matching job. This method should be called after all partitions have been matched by match_samples_batch_partition(). Returns the final tree sequence for the batch job. work_dir is retained and not deleted.

Parameters:

work_dir (str) – The working directory for the batch job, as written by match_samples_batch_init().

Returns:

The tree sequence representing the inferred history of the samples.

Return type:

tskit.TreeSequence

Container classes#

class tsinfer.Variant(site, genotypes, alleles)[source]#

A single variant. Mirrors the definition in tskit.

class tsinfer.Site(id, position, ancestral_allele, metadata, time, alleles)[source]#

A single site. Mirrors the definition in tskit with some additional fields.

Exceptions#

exception tsinfer.FileFormatError[source]#

Exception raised when a malformed file is encountered.