API Documentation

API Documentation#

File formats#

tsinfer.load(path)[source]#

Loads a tsinfer SampleData or AncestorData file from the specified path. The correct class will be determined by the content of the file. If the file is format not recognised a FileFormatError will be thrown.

Parameters:

path (str) – The path of the file we wish to load.

Returns:

The corresponding SampleData or AncestorData instance opened in read only mode.

Return type:

AncestorData or SampleData.

Raises:

FileFormatError if the file cannot be read.

Sample data#

class tsinfer.SampleData(sequence_length=0, *, path=None, num_flush_threads=0, compressor=DEFAULT_COMPRESSOR, chunk_size=1024, max_file_size=None)[source]#

Class representing input sample data used for inference. See sample data file format specifications for details on the structure of this file.

The most common usage for this class will be to import data from some external source and save it to file for later use. This will usually follow a pattern like:

sample_data = tsinfer.SampleData(path="mydata.samples")
sample_data.add_site(position=1234, genotypes=[0, 0, 1, 0], alleles=["G", "C"])
sample_data.add_site(position=5678, genotypes=[1, 1, 1, 0], alleles=["A", "T"])
sample_data.finalise()

This creates a sample data file for four haploid samples and two sites, and saves it in the file “mydata.samples”. Note that the call to finalise() is essential here to ensure that all data will be correctly flushed to disk. For convenience, a context manager may also be used to ensure this is done:

with tsinfer.SampleData(path="mydata.samples") as sample_data:
    sample_data.add_site(1234, [0, 0, 1, 0], ["G", "C"])
    sample_data.add_site(5678, [1, 1, 1, 0], ["A", "T"])

More complex data models consisting of populations and polyploid individuals can also be specified. For example, we might have:

with tsinfer.SampleData(path="mydata.samples") as sample_data:
    # Define populations
    sample_data.add_population(metadata={"name": "CEU"})
    sample_data.add_population(metadata={"name": "YRI"})
    # Define individuals
    sample_data.add_individual(ploidy=2, population=0, metadata={"name": "NA12"})
    sample_data.add_individual(ploidy=2, population=0, metadata={"name": "NA13"})
    sample_data.add_individual(ploidy=2, population=0, metadata={"name": "NA14"})
    sample_data.add_individual(ploidy=2, population=1, metadata={"name": "NA15"})
    # Define sites and genotypes
    sample_data.add_site(1234, [0, 1, 1, 1, 0, 0, 0, 0], ["G", "C"])
    sample_data.add_site(5678, [0, 0, 0, 0, 0, 0, 1, 1], ["A", "T"])

In this example we defined two populations and four diploid individuals, and so our genotypes arrays are of length eight. Thus, at first site the first individual is heterozygous, the second is homozygous with the derived allele and the other two individuals are homozygous with the ancestral allele. To illustrate how we can use site and population metadata to link up with external data sources we use the 1000 genomes identifiers (although of course the genotype data is fake). Here we suppose that we have the famous NA12878 trio from the CEU population, and one other individual from the YRI population. This metadata is then embedded in the final tree sequence that we infer, allowing us to use it conveniently in downstream analyses.

Note

If a path is specified, the max_file_size option puts an upper limit on the possible size of the created file. On non-Windows systems, space for this file is not immediately allocated but just “reserved” using sparse file systems. However, on Windows systems the file is allocated immediately, so max_file_size takes a smaller default value, to avoid allocating very large files for no reason. Users who wish to run large inferences on Windows may therefore need to explictly set an appropriate max_file_size. Note that the max_file_size is only used while the file is being built: one the file has been finalised, it is shrunk to its minimum size.

Parameters:
  • sequence_length (float) – If specified, this is the sequence length that will be associated with the tree sequence output by tsinfer.infer() and tsinfer.match_samples(). If provided site coordinates must be less than this value.

  • path (str) – The path of the file to store the sample data. If None, the information is stored in memory and not persistent.

  • num_flush_threads (int) – The number of background threads to use for compressing data and flushing to disc. If <= 0, do not spawn any threads but use a synchronous algorithm instead. Default=0.

  • compressor (numcodecs.abc.Codec) – A numcodecs.abc.Codec instance to use for compressing data. Any codec may be used, but problems may occur with very large datasets on certain codecs as they cannot compress buffers >2GB. If None, do not use any compression. Default=:class:numcodecs.zstd.Zstd.

  • chunk_size (int) – The chunk size used for zarr arrays. This affects compression level and algorithm performance. Default=1024.

  • max_file_size (int) – If a file is being used to store this data, set a maximum size in bytes for the stored file. If None, the default value of 1GiB (2**30 bytes) is used on Windows and 1TiB (2**40 bytes) on other platforms (see above for details).

property sites_genotypes#

The “raw” genotypes array for each site, as passed in when adding sites. The values in this array correspond to indexes into the sites_alleles array.

property sites_alleles#

The alleles list for each site, in the order given when adding sites. If missing data is present, the last allelic state will be None.

property sites_ancestral_allele#

The index into each sites_alleles list which corresponds to the ancestral state. If the ancestral state is unknown, this is indicated by a value of tskit.MISSING_DATA (-1).

data_equal(other)[source]#

Returns True if all the data attributes of this input file and the specified input file are equal. This compares every attribute except the provenance.

To compare two SampleData instances for exact equality of all data including provenance data, use s1 == s2.

Parameters:

other (SampleData) – The other SampleData instance to compare with.

Returns:

True if the data held in this SampleData instance is identical to the date held in the other instance.

Return type:

bool

assert_data_equal(other)[source]#

The same as data_equal(), but raises an assertion rather than returning False. This is useful for testing.

subset(individuals=None, sites=None, *, sequence_length=None, **kwargs)[source]#

Returns a subset of this sample data file consisting of the specified individuals and sites. It is important to note that these are individual IDs and not sample IDs (corresponding to distinct haplotypes within an individual). When working with haploid data, the individual and sample IDs are guaranteed to be the same, and so can be used interchangably.

Parameters:
  • individuals (arraylike) – The individual IDs to retain in the returned subset. IDs must be unique, and refer to valid individuals in the current dataset. IDs can be supplied in any order, and the order will be preserved in the returned data file (i.e., individuals[0] will be the first individual in the new dataset, etc).

  • sites (arraylike) – The site IDs to retain in the returned subset. IDs must be unique, and refer to valid sites in the current dataset. Site IDs can be supplied in any order, but the order will not be preserved in the returned data file, so that sites are always in position sorted order in the output.

  • sequence_length (float) – The sequence length to use for the returned object. If None, use the same sequence length as in the original sample data file.

  • **kwargs – Further arguments passed to the SampleData constructor.

Returns:

A SampleData object.

Return type:

SampleData

min_site_times(individuals_only=False)[source]#

Returns a numpy array of the lower bound of the time of sites in the SampleData file. Each individual with a nonzero time (from the individuals_time array) gives a lower bound on the age of sites where the individual carries a derived allele.

Returns:

A numpy array of the lower bound for each sites time.

Return type:

numpy.ndarray(dtype=float64)

classmethod from_tree_sequence(ts, use_sites_time=None, use_individuals_time=None, **kwargs)[source]#

Create a SampleData instance from the sample nodes in an existing tree sequence. Each sample node in the tree sequence results in a sample created in the returned object. Populations in the tree sequence will be copied into the returned object. Individuals in the tree sequence that are associated with any sample nodes will also be incorporated: the ploidy of each individual is assumed to be the number of sample nodes which reference that individual; individuals with no sample nodes are omitted. A new haploid individual is created for any sample node which lacks an associated individual in the existing tree sequence. Thus a tree sequence with u sample nodes but no individuals will be translated into a SampleData file with u haploid individuals and u samples.

Metadata associated with individuals, populations, sites, and at the top level of the tree sequence, is also stored in the appropriate places in the returned SampleData instance. Any such metadata must either have a schema defined or be JSON encodable text. See the tskit documentation for more details on metadata schemas.

Parameters:
  • ts (tskit.TreeSequence) – The tskit.TreeSequence from which to generate samples.

  • use_sites_time (bool) – If True, the times of nodes in the tree sequence are used to set a time for each site (which affects the relative temporal order of ancestors during inference). Times for a site are only used if there is a single mutation at that site, in which case the node immediately below the mutation is taken as the origination time for the variant. If False, the frequency of the variant is used as a proxy for the relative variant time (see add_site()). Defaults to False.

  • use_individuals_time (bool) – If True, use the time of the sample nodes in the tree sequence as the time of the individuals associated with those nodes in the sample data file. This is likely only to be meaningful if use_sites_time is also True. If False, all individuals are set to time 0. Defaults to False.

  • **kwargs – Further arguments passed to the SampleData constructor.

Returns:

A SampleData object.

Return type:

SampleData

add_population(metadata=None)[source]#

Adds a new Population to this SampleData and returns its ID.

All calls to this method must be made before individuals or sites are defined.

Parameters:

metadata (dict) – A JSON encodable dict-like object containing metadata that is to be associated with this population.

Returns:

The ID of the newly added population.

Return type:

int

add_individual(ploidy=1, metadata=None, population=None, location=None, time=0, flags=None)[source]#

Adds a new Individual to this SampleData and returns its ID and those of the resulting additional samples. Adding an individual with ploidy k results in k new samples being added, and each of these samples will be associated with the new individual. Each new sample will also be associated with the specified population ID. It is an error to specify a population ID that does not correspond to a population defined using add_population().

All calls to this method must be made after populations are defined using add_population() and before sites are defined using add_site().

Parameters:
  • ploidy (int) – The ploidy of this individual. This corresponds to the number of samples added that refer to this individual. Defaults to 1 (haploid).

  • metadata (dict) – A JSON encodable dict-like object containing metadata that is to be associated with this individual.

  • population (int) – The ID of the population to associate with this individual (or more precisely, with the samples for this individual). If not specified or None, defaults to the null population (-1).

  • location (arraylike) – An array-like object defining n-dimensional spatial location of this individual. If not specified or None, the empty location is stored.

  • time (float) – The historical time into the past when the samples associated with this individual were taken. By default we assume that all samples come from the present time (i.e. the default time is 0).

  • flags (int) – The bitwise flags for this individual.

Returns:

The ID of the newly added individual and a list of the sample IDs also added.

Return type:

tuple(int, list(int))

add_site(position, genotypes, alleles=None, metadata=None, inference=None, time=None, ancestral_allele=None)[source]#

Adds a new site to this SampleData and returns its ID.

At a minimum, the new site must specify the position and genotypes. Sites must be added in increasing order of position; duplicate positions are not supported. For each site a list of alleles may be supplied. This list defines the ancestral and derived states at the site. For example, if we set alleles=["A", "T"] then the ancestral state is “A” and the derived state is “T”. The observed state for each sample is then encoded using the genotypes parameter. Thus if we have n samples then this must be a one dimensional array-like object with length n. The genotypes index into the list of alleles, so that for a given array g and sample index j, g[j] should contain 0 if sample j carries the ancestral state at this site and 1 if it carries the derived state. For multiple derived states, there may be more than 2 alleles`, and ``g[j] can be greater than 1, but such sites are not used for inference. All sites must have genotypes for the same number of samples.

All populations and individuals must be defined before this method is called. If no individuals have been defined using add_individual(), the first call to this method adds n haploid individuals, where n is the length of the genotypes array.

Parameters:
  • position (float) – The floating point position of this site. Must be less than the sequence_length if provided to the SampleData constructor. Must be greater than all previously added sites.

  • genotypes (arraylike) – An array-like object defining the sample genotypes at this site. The array of genotypes corresponds to the observed alleles for each sample, represented by indexes into the alleles array. Missing sample data can be represented by tskit.MISSING_DATA in this array. The input is converted to a numpy array with dtype np.int8; therefore, for maximum efficiency ensure that the input array is also of this type.

  • alleles (list(str)) – A list of strings defining the alleles at this site. Only biallelic sites can currently be used for inference. Sites with 3 or more non-missing alleles cannot have inference (below) set to True. If missing data is present in the genotypes array, the stored list of alleles will be modified as necessary so that alleles[tskit.MISSING_DATA] == None. If alleles is not specified or None, a default of [“0”, “1”] is used.

  • metadata (dict) – A JSON encodable dict-like object containing metadata that is to be associated with this site.

  • time (float) – The time of occurence (pastwards) of the mutation to the derived state at this site. If not specified or None, the frequency of the derived alleles (i.e., the proportion of non-zero values in the genotypes, out of all the non-missing values) will be used in inference. For biallelic sites this frequency should provide a reasonable estimate of the relative time, as used to order ancestral haplotypes during the inference process. For sites not used in inference, such as singletons or sites with more than two alleles or when the time is specified as np.nan, then the value is unused. Defaults to None.

  • ancestral_allele (int) – A positive index into the alleles array, specifying which allele is the ancestral state, or tskit.MISSING_DATA (-1) if the ancestral state is unknown (in which case the site will not be used for inference, and the ancestral state will be inferred using parsimony). Default: None, treated as 0, so that the first allele in the list is taken as the ancestral state.

Returns:

The ID of the newly added site.

Return type:

int

finalise()[source]#

Ensures that the state of the data is flushed and writes the provenance for the current operation. The specified ‘command’ is used to fill the corresponding entry in the provenance dictionary.

merge(other, **kwargs)[source]#

Returns a copy of this SampleData file merged with the specified other SampleData file. Subsequent keyword arguments are passed to the SampleData constructor for the returned merged dataset.

The datasets are merged by following process:

  1. We add the populations from this dataset to the result, followed by the populations from other. Population references from the two datasets are updated accordingly.

  2. We add individual data from this dataset to the result, followed by the individuals from the other dataset.

  3. We merge the variant data from the two datasets by comparing sites by their position. If two sites in the datasets have the same position we combine the genotype data. The alleles from this dataset are updated to include any new alleles in other, and we then combine and update the genotypes accordingly. It is an error if sites with the same position have different ancestral state, time, or metadata values. For sites that exist in one dataset and not the other, we insert the site with tskit.MISSING_DATA present in the genotypes for the dataset that does not contain the site.

  4. We add the provenances for this dataset, followed by the provenances for the other dataset.

Parameters:

other (SampleData) – The other SampleData instance to to merge.

Returns:

A new SampleData instance which contains the merged data from the two datasets.

Return type:

SampleData

sites(ids=None)[source]#

Returns an iterator over the Site objects. A subset of the sites can be returned using the ids parameter. This must be a list of integer site IDs.

num_alleles(sites=None)[source]#

Returns a numpy array of the number of alleles at each site. Missing data is not counted as an allele.

Parameters:

sites (array) – A numpy array of sites for which to return data. If None (default) return all sites.

Returns:

A numpy array of the number of alleles at each site.

Return type:

numpy.ndarray(dtype=uint32)

variants(sites=None, recode_ancestral=None)[source]#

Returns an iterator over the Variant objects. This is equivalent to the tskit.TreeSequence.variants() iterator. If recode_ancestral is True, the .alleles attribute of each variant is guaranteed to return the alleles in an order such that the ancestral state is the first item in the list. In this case, variant.alleles may list the alleles in a different order from the input order as listed in variant.site.alleles, and the values in genotypes array will be recoded so that the ancestral state will have a genotype of 0. If the ancestral state is unknown, the original input order is kept.

If a variant contains missing data, it is guaranteed that the alleles attribute for that variant satisfies alleles[tskit.MISSING_DATA] == None.

Parameters:
  • sites (array) – A numpy array of ascending site ids for which to return data. If None (default) return all sites.

  • recode_ancestral (bool) – If True, recode genotypes at sites where the ancestral state is known such that the ancestral state is coded as 0, as described above. Otherwise return genotypes in the input allele encoding. Default: None treated as False.

Returns:

An iterator over the variants in the sample data file.

Return type:

iter(Variant)

haplotypes(samples=None, sites=None, recode_ancestral=None)[source]#

Returns an iterator over the (sample_id, haplotype) pairs. Each haplotype is an array of indexes, where the i th value is an index into the alleles list for the i th specified site (but see warning below).

Warning

If recode_ancestral=True, the haplotype values may not correspond to indexes into the sites.alleles list. Instead, they will correspond to the variant.alleles list, returned when iterating over variants() using variants(recode_ancestral=True).

Parameters:
  • samples (list) – The sample IDs for which haplotypes are returned. If None, return haplotypes for all sample nodes, otherwise this may be a numpy array (or array-like) object (converted to dtype=np.int32).

  • sites (array) – A numpy array of sites to use, or None for all sites.

  • recode_ancestral (bool) – If True, recode genotypes so that the ancestral state is coded as 0 as described under variants(). Otherwise return genotypes in the input allele encoding. Default: None, treated as False.

Returns:

An iterator over (sample_id, haplotype) pairs.

Return type:

iter(int, numpy.ndarray(dtype=int8))

add_provenance(timestamp, record)#

Adds a new provenance record with the specified timestamp and record. Timestamps should ISO8601 formatted, and record is some JSON encodable object.

arrays()#

Returns a list of all the zarr arrays in this DataContainer.

clear_provenances()#

Clear all provenances in this instance

close()#

Close this DataContainer. Any read or write operations attempted after calling this will fail.

copy(path=None, max_file_size=None)#

Returns a copy of this DataContainer opened in ‘edit’ mode. If path is specified, this must not be equal to the path of the current data container.

property file_size#

Returns the size of the underlying file, or -1 if we do not have a file associated.

property info#

Returns a string containing the zarr info for each array.

provenances()#

Returns an iterator over the (timestamp, record) pairs representing the provenances for this data container.

record_provenance(command=None, **kwargs)#

Records the provenance information for this file using the tskit provenances schema.

Todo

  1. Add documentation for the data attributes in read-mode.

  2. Define copy mode.

  3. Provide example of updating inference_sites

Ancestor data#

class tsinfer.AncestorData(position, sequence_length, *, path=None, num_flush_threads=0, compressor=None, chunk_size=1024, max_file_size=None)[source]#

Class representing the stored ancestor data produced by generate_ancestors(). See the ancestor data file format specifications for details on the structure of this file.

See the documentation for SampleData for a discussion of the max_file_size parameter.

Parameters:
  • position (arraylike) – Integer array of the site positions of the ancestors. All values should be >0 and the array should be monotonically increasing.

  • sequence_length (float) – Total length of the sequence, site positions must be less than this value.

  • path (str) – The path of the file to store the ancestor data. If None, the information is stored in memory and not persistent.

  • num_flush_threads (int) – The number of background threads to use for compressing data and flushing to disc. If <= 0, do not spawn any threads but use a synchronous algorithm instead. Default=0.

  • compressor (numcodecs.abc.Codec) – A numcodecs.abc.Codec instance to use for compressing data. Any codec may be used, but problems may occur with very large datasets on certain codecs as they cannot compress buffers >2GB. If None, do not use any compression. Default=:class:numcodecs.zstd.Zstd.

  • chunk_size (int) –

    The chunk size used for zarr arrays in the sample dimension. This affects compression level and algorithm performance. Default=1024.

  • chunk_size_sites (int) –

    The chunk size used for the genotype zarr arrays in the sites dimension. This affects compression level and algorithm performance. Default=16384.

  • max_file_size (int) – If a file is being used to store this data, set a maximum size in bytes for the stored file. If None, the default value of 1GiB is used on Windows and 1TiB on other platforms (see above for details).

data_equal(other)[source]#

Returns True if all the data attributes of this input file and the specified input file are equal. This compares every attribute.

property sequence_length#

Returns the sequence length.

property num_sites#

The number of inference sites used to generate the ancestors

property sites_position#

The positions of the inference sites used to generate the ancestors

add_provenance(timestamp, record)#

Adds a new provenance record with the specified timestamp and record. Timestamps should ISO8601 formatted, and record is some JSON encodable object.

arrays()#

Returns a list of all the zarr arrays in this DataContainer.

clear_provenances()#

Clear all provenances in this instance

close()#

Close this DataContainer. Any read or write operations attempted after calling this will fail.

copy(path=None, max_file_size=None)#

Returns a copy of this DataContainer opened in ‘edit’ mode. If path is specified, this must not be equal to the path of the current data container.

property file_size#

Returns the size of the underlying file, or -1 if we do not have a file associated.

property info#

Returns a string containing the zarr info for each array.

provenances()#

Returns an iterator over the (timestamp, record) pairs representing the provenances for this data container.

record_provenance(command=None, **kwargs)#

Records the provenance information for this file using the tskit provenances schema.

property ancestors_length#

Returns the length of ancestors in physical coordinates.

insert_proxy_samples(sample_data, *, sample_ids=None, epsilon=None, allow_mutation=False, require_same_sample_data=True, **kwargs)[source]#

Take a set of samples from a sample_data instance and create additional “proxy sample ancestors” from them, returning a new AncestorData instance including both the current ancestors and the additional ancestors at the appropriate time points.

A proxy sample ancestor is an ancestor based upon a known sample. At sites used in the full inference process, the haplotype of this ancestor is identical to that of the sample on which it is based. The time of the ancestor is taken to be a fraction epsilon older than the sample on which it is based.

A common use of this function is to provide ancestral nodes for anchoring historical samples at the correct time when matching them into a tree sequence during the tsinfer.match_samples() stage of inference. For this reason, by default, the samples chosen from sample_data are those associated with historical (i.e. non-contemporary) individuals. This can be altered by using the sample_ids parameter.

Note

The proxy sample ancestors inserted here will correspond to extra nodes in the inferred tree sequence. At sites which are not used in the full inference process (e.g. sites unique to a single historical sample), these proxy sample ancestor nodes may have a different genotype from their corresponding sample.

Parameters:
  • sample_data (SampleData) – The SampleData instance from which to select the samples used to create extra ancestors.

  • sample_ids (list(int)) – A list of sample ids in the sample_data instance that will be selected to create the extra ancestors. If None (default) select all the historical samples, i.e. those associated with an Individual whose time is greater than zero. The order of ids is ignored, as are duplicate ids.

  • epsilon (list(float)) – An list of small time increments determining how much older each proxy sample ancestor is than the corresponding sample listed in sample_ids. A single value is also allowed, in which case it is used as the time increment for all selected proxy sample ancestors. If None (default) find \({\delta}t\), the smallest time difference between between the sample times and the next oldest ancestor in the current AncestorData instance, setting epsilon = \({\delta}t / 100\) (or, if all selected samples are at least as old as the oldest ancestor, take \({\delta}t\) to be the smallest non-zero time difference between existing ancestors).

  • allow_mutation (bool) – If False (the default), any site in a proxy sample ancestor that has a derived allele must have a pre-existing mutation in an older (non-proxy) ancestor, otherwise an error is raised. Alternatively, if allow_mutation is True, proxy ancestors can contain a de-novo mutation at a site that also has a mutation elsewhere (i.e. breaking the infinite sites assumption), allowing them to possess derived alleles at sites where there are no pre-existing mutations in older ancestors.

  • require_same_sample_data (bool) – Deprecated Has no effect.

  • **kwargs – Further arguments passed to the constructor when creating the new AncestorData instance which will be returned.

Returns:

A new AncestorData object.

Return type:

AncestorData

truncate_ancestors(lower_time_bound, upper_time_bound, length_multiplier=2, buffer_length=1000, **kwargs)[source]#

Truncates the length of ancestors above a given time and returns a new AncestorData instance.

Given a set of haplotypes H such that lower_time_bound <= h.time < upper_time_bound, we let max_len = length_multiplier * max(max(h.length) for h in H). Then, we truncate all haplotypes containing at least one focal site where h.time >= upper, ensuring these haplotypes extend no further than half of max_len to the either side of the leftmost and rightmost focal sites of the ancestral haplotype. Note that ancestors above upper_time_bound may still be longer than max_len if the ancestor contains greater than 2 focal sites.

This function should be used when tsinfer.generate_ancestors() generates old ancestors which are very long, as these can significantly slow down matching time. Older ancestors should generally be shorter than younger ancestors, so truncating the lengths of older ancestors has negligible effect on inference accuracy.

Note

Please ensure that the time values provided to lower_time_bound and upper_time_bound match the units used in the AncestorData file, i.e. if your ancestors do not have site times specified, upper_time_bound should be between 0 and 1.

Parameters:
  • lower_time_bound (float) – Defines the lower bound (inclusive) of the half open interval where we search for a truncation value.

  • upper_time_bound (float) – Defines the upper bound (exclusive) of the half open interval where we search for a truncation value. The truncation value is the length of the longest haplotype in this interval multiplied by length_multiplier. The length of ancestors as old or older than upper_time_bound will be truncated using this value.

  • length_multiplier (float) – A multiplier for the length of the longest ancestor in the half-open interval between lower_time_bound (inclusive) and uppper_time_bound (exclusive), i.e. if the longest ancestor in the interval is 1 megabase, a length_multiplier of 2 creates a maximum length of 2 megabases.

  • buffer_length (int) – The number of changed ancestors to buffer before writing to disk.

  • **kwargs – Further arguments passed to the AncestorData.copy() when creating the new AncestorData instance which will be returned.

Returns:

A new AncestorData object.

Return type:

AncestorData

add_ancestor(start, end, time, focal_sites, haplotype)[source]#

Adds an ancestor with the specified haplotype, with ancestral material over the interval [start:end], that is associated with the specified timepoint and has new mutations at the specified list of focal sites. Ancestors should be added in time order, with the oldest first. The id of the added ancestor is returned.

finalise()[source]#

Ensures that the state of the data is flushed and writes the provenance for the current operation. The specified ‘command’ is used to fill the corresponding entry in the provenance dictionary.

ancestor(id_)[source]#

Returns the ancestor with the specified ID.

Return type:

Ancestor

ancestors(indexes=None)[source]#

Returns an iterator over all the ancestors. If indexes is provided, it should be a sorted list of indexes giving a subset of ancestors to return. For efficiency, the indexes should be a numpy integer array.

Todo

  1. Add documentation for the data attributes in read-mode.

Running inference#

tsinfer.infer(sample_data, *, recombination_rate=None, mismatch_ratio=None, path_compression=True, exclude_positions=None, post_process=None, num_threads=0)[source]#

Runs the full inference pipeline on the specified SampleData instance and returns the inferred tskit.TreeSequence. See matching ancestors & samples in the documentation for details of recombination_rate, mismatch_ratio and path_compression.

Note

For finer grained control over inference, for example to set different mismatch ratios when matching ancestors versus samples, run tsinfer.generate_ancestors(), tsinfer.match_ancestors() and tsinfer.match_samples() separately.

Parameters:
  • sample_data (SampleData) – The input SampleData instance representing the observed data that we wish to make inferences from.

  • recombination_rate (float, msprime.RateMap) – Either a floating point value giving a constant rate \(\rho\) per unit length of genome, or an msprime.RateMap object. This is used to calculate the probability of recombination between adjacent sites. If None, all matching conflicts are resolved by recombination and all inference sites will have a single mutation (equivalent to mismatch_ratio near zero)

  • mismatch_ratio (float) – The probability of a mismatch relative to the median probability of recombination between adjacent sites: can only be used if a recombination rate has been set (default: None treated as 1 if recombination_rate is set).

  • path_compression (bool) – Whether to merge edges that share identical paths (essentially taking advantage of shared recombination breakpoints).

  • post_process (bool) – Whether to run the post_process() method on the the tree sequence which, among other things, removes ancestral material that does not end up in the current samples (if not specified, defaults to True)

  • exclude_positions (array_like) – A list of site positions to exclude for full inference. Sites with these positions will not be used to generate ancestors, and not used during the copying process. Any such sites that exist in the sample data file will be included in the trees after the main inference process using parsimony. The list does not need to be in to be in any particular order, and can include site positions that are not present in the sample data file.

  • num_threads (int) – The number of worker threads to use in parallelised sections of the algorithm. If <= 0, do not spawn any threads and use simpler sequential algorithms (default).

  • simplify (bool) – When post_processing, only simplify the tree sequence. deprecated but retained for backwards compatibility (default: None).

Returns:

The tskit.TreeSequence object inferred from the input sample data.

Return type:

tskit.TreeSequence

tsinfer.generate_ancestors(sample_data, *, path=None, exclude_positions=None, num_threads=0, genotype_encoding=None, mmap_temp_dir=None, **kwargs)[source]#

Runs the ancestor generation algorithm on the specified SampleData instance and returns the resulting AncestorData instance. If you wish to store generated ancestors persistently on file you must pass the path keyword argument to this function. For example,

ancestor_data = tsinfer.generate_ancestors(sample_data, path="mydata.ancestors")

Other keyword arguments are passed to the AncestorData constructor, which may be used to control the storage properties of the generated file.

Ancestor generation involves loading the entire genotype matrix into memory, by default using one byte per haploid genotype, which can be prohibitively large when working with sample sizes of 100,000 or more. There are two options to help mitigate memory usage. The genotype_encoding parameter allows the user to specify a more compact encoding scheme, which reduces storage space for datasets with small numbers of alleles. Currently, the GenotypeEncoding.ONE_BIT encoding is supported, which provides 8-fold compression of biallelic, non-missing data. An error is raised if an encoding that does not support the range of values present in a given dataset is provided.

The second option for reducing the RAM footprint of this function is to use the mmap_temp_dir parameter. This allows the genotype data to be cached on file, transparently using the operating system’s virtual memory subsystem to swap in and out the data. This can work well if the encoded genotype matrix almost fits in RAM and fast local storage is available. However, if the size of the encoded genotype matrix is more than, say, twice the available RAM it is unlikely that this function will complete in a reasonable time-frame. A temporary file is created in the specified mmap_temp_dir, which is automatically deleted when the function completes.

Warning

The mmap_temp_dir option is a silent no-op on Windows!

Parameters:
  • sample_data (SampleData) – The SampleData instance that we are genering putative ancestors from.

  • path (str) – The path of the file to store the sample data. If None, the information is stored in memory and not persistent.

  • exclude_positions (array_like) – A list of site positions to exclude for full inference. Sites with these positions will not be used to generate ancestors, and not used during the copying process. The list does not need be in any particular order.

  • num_threads (int) – The number of worker threads to use. If < 1, use a simpler synchronous algorithm.

  • genotype_encoding (int) – The encoding to use for genotype data internally when generating ancestors. See the GenotypeEncoding class for the available options. Defaults to one-byte per genotype.

  • mmap_temp_dir (str) – The directory within which to create the temporary backing file when using mmaped memory for bulk genotype storage. If None (the default) allocate memory directly using the standard mechanism. This is an advanced option, usually only relevant when working with very large datasets (see above for more information).

Returns:

The inferred ancestors stored in an AncestorData instance.

Return type:

AncestorData

class tsinfer.GenotypeEncoding(value)[source]#

The encoding scheme used to store genotypes.

EIGHT_BIT = 0#

The default approach of using one-byte per genotype. Supports up to 127 alleles and missing data.

ONE_BIT = 1#

Encode binary genotype data using a single bit.

tsinfer.match_ancestors(sample_data, ancestor_data, *, recombination_rate=None, mismatch_ratio=None, path_compression=True, num_threads=0)[source]#

Run the ancestor matching algorithm on the specified SampleData and AncestorData instances, returning the resulting tskit.TreeSequence representing the complete ancestry of the putative ancestors. See matching ancestors & samples in the documentation for details of recombination_rate, mismatch_ratio and path_compression.

Parameters:
  • sample_data (SampleData) – The SampleData instance representing the input data.

  • ancestor_data (AncestorData) – The AncestorData instance representing the set of ancestral haplotypes for which we are finding a history.

  • recombination_rate (float, msprime.RateMap) – Either a floating point value giving a constant rate \(\rho\) per unit length of genome, or an msprime.RateMap object. This is used to calculate the probability of recombination between adjacent sites. If None, all matching conflicts are resolved by recombination and all inference sites will have a single mutation (equivalent to mismatch_ratio near zero)

  • mismatch_ratio (float) – The probability of a mismatch relative to the median probability of recombination between adjacent sites: can only be used if a recombination rate has been set (default: None treated as 1 if recombination_rate is set).

  • path_compression (bool) – Whether to merge edges that share identical paths (essentially taking advantage of shared recombination breakpoints).

  • num_threads (int) – The number of match worker threads to use. If this is <= 0 then a simpler sequential algorithm is used (default).

Returns:

The ancestors tree sequence representing the inferred history of the set of ancestors.

Return type:

tskit.TreeSequence

tsinfer.match_samples(sample_data, ancestors_ts, *, recombination_rate=None, mismatch_ratio=None, path_compression=True, post_process=None, indexes=None, force_sample_times=False, num_threads=0)[source]#

Runs the sample matching algorithm on the specified SampleData instance and ancestors tree sequence, returning the final tskit.TreeSequence instance containing the full inferred history for all samples and sites. See matching ancestors & samples in the documentation for details of recombination_rate, mismatch_ratio and path_compression.

Parameters:
  • sample_data (SampleData) – The SampleData instance representing the input data.

  • ancestors_ts (tskit.TreeSequence) – The tskit.TreeSequence instance representing the inferred history among ancestral ancestral haplotypes.

  • recombination_rate (float, msprime.RateMap) – Either a floating point value giving a constant rate \(\rho\) per unit length of genome, or an msprime.RateMap object. This is used to calculate the probability of recombination between adjacent sites. If None, all matching conflicts are resolved by recombination and all inference sites will have a single mutation (equivalent to mismatch_ratio near zero)

  • mismatch_ratio (float) – The probability of a mismatch relative to the median probability of recombination between adjacent sites: can only be used if a recombination rate has been set (default: None treated as 1 if recombination_rate is set).

  • path_compression (bool) – Whether to merge edges that share identical paths (essentially taking advantage of shared recombination breakpoints).

  • indexes (array_like) – An array of indexes into the sample_data file of the samples to match (in increasing order) or None for all samples.

  • post_process (bool) – Whether to run the post_process() method on the the tree sequence which, among other things, removes ancestral material that does not end up in the current samples (if not specified, defaults to True)

  • force_sample_times (bool) – After matching, should an attempt be made to adjust the time of “historical samples” (those associated with an individual having a non-zero time) such that the sample nodes in the tree sequence appear at the time of the individual with which they are associated.

  • num_threads (int) – The number of match worker threads to use. If this is <= 0 then a simpler sequential algorithm is used (default).

  • simplify (bool) – Treated as an alias for post_process, deprecated but currently retained for backwards compatibility if set to False.

Returns:

The tree sequence representing the inferred history of the sample.

Return type:

tskit.TreeSequence

tsinfer.augment_ancestors(sample_data, ancestors_ts, indexes, *, recombination_rate=None, mismatch_ratio=None, path_compression=True, num_threads=0)[source]#

Runs the sample matching algorithm on the specified SampleData instance and ancestors tree sequence, for the specified subset of sample indexes, returning the tskit.TreeSequence instance including these samples. This tree sequence can then be used as an ancestors tree sequence for subsequent matching against all samples. See matching ancestors & samples in the documentation for details of recombination_rate, mismatch_ratio and path_compression.

Parameters:
  • sample_data (SampleData) – The SampleData instance representing the input data.

  • ancestors_ts (tskit.TreeSequence) – The tskit.TreeSequence instance representing the inferred history among ancestral ancestral haplotypes.

  • indexes (array) – The sample indexes to insert into the ancestors tree sequence, in increasing order.

  • recombination_rate (float, msprime.RateMap) – Either a floating point value giving a constant rate \(\rho\) per unit length of genome, or an msprime.RateMap object. This is used to calculate the probability of recombination between adjacent sites. If None, all matching conflicts are resolved by recombination and all inference sites will have a single mutation (equivalent to mismatch_ratio near zero)

  • mismatch_ratio (float) – The probability of a mismatch relative to the median probability of recombination between adjacent sites: can only be used if a recombination rate has been set (default: None treated as 1 if recombination_rate is set).

  • path_compression (bool) – Whether to merge edges that share identical paths (essentially taking advantage of shared recombination breakpoints).

  • num_threads (int) – The number of match worker threads to use. If this is <= 0 then a simpler sequential algorithm is used (default).

Returns:

The specified ancestors tree sequence augmented with copying paths for the specified sample.

Return type:

tskit.TreeSequence

tsinfer.post_process(ts, *, split_ultimate=None, erase_flanks=None)[source]#

Post-process a tsinferred tree sequence into a more conventional form. This is the function run by default on the final tree sequence output by match_samples(). It involves the following 4 steps:

  1. If the oldest node is connected to a single child via an edge that spans the entire tree sequence, this oldest node is removed, so that its child becomes the new root (this step is undertaken to remove the “virtual-root-like node” which is added to ancestor tree sequences to enable matching).

  2. If the oldest node is removed in the first step and the new root spans the entire genome, it is treated as the “ultimate ancestor” and (unless split_ultimate is False) the node is split into multiple coexisiting nodes with the splits occurring whenever the children of the ultimate ancestor change. The rationale is that tsinfer creates a single ancestral haplotype with all inference sites in the ancestral state: this is, however, unlikely to represent a single ancestor in the past. If the tree sequence is then dated, the fact that ultimate ancestor is split into separate nodes allows these nodes to be dated to different times.

  3. Often, extensive regions of genome exist before the first defined site and after the last defined site. Since no data exists in these sections of the genome, post processing by default erases the inferred topology in these regions. However, if erase_flanks is False, the flanking regions at the start and end will be assigned the same topology as inferred at the first and last site respectively.

  4. The sample nodes are reordered such that they are the first nodes listed in the node table, removing tree nodes and edges that are not on a path between the root and any of the samples (by applying the simplify() method with keep_unary set to True but filter_sites, filter_populations and filter_individuals set to False).

Parameters:
  • split_ultimate (bool) – If True (default) and the oldest node is the only parent to a single “ultimate ancestor” node, attempt to split this node into many separate nodes (see above). If False do not attempt to identify or split an ultimate ancestor node.

  • erase_flanks (bool) – If True (default), keep only the inferred topology between the first and last sites. If False, output regions of topology inferred before the first site and after the last site.

Returns:

The post-processed tree sequence.

Return type:

tskit.TreeSequence

Container classes#

class tsinfer.Variant(site, genotypes, alleles)[source]#

A single variant. Mirrors the definition in tskit.

class tsinfer.Site(id, position, ancestral_allele, metadata, time, alleles)[source]#

A single site. Mirrors the definition in tskit with some additional fields.

Exceptions#

exception tsinfer.FileFormatError[source]#

Exception raised when a malformed file is encountered.