API Documentation#
File formats#
- tsinfer.load(path)[source]#
Loads a tsinfer
SampleData
orAncestorData
file from the specified path. The correct class will be determined by the content of the file. If the file is format not recognised aFileFormatError
will be thrown.- Parameters:
path (str) – The path of the file we wish to load.
- Returns:
The corresponding
SampleData
orAncestorData
instance opened in read only mode.- Return type:
- Raises:
FileFormatError
if the file cannot be read.
Sample data#
- class tsinfer.SampleData(sequence_length=0, *, path=None, num_flush_threads=0, compressor=DEFAULT_COMPRESSOR, chunk_size=1024, max_file_size=None)[source]#
Class representing input sample data used for inference. See sample data file format specifications for details on the structure of this file.
The most common usage for this class will be to import data from some external source and save it to file for later use. This will usually follow a pattern like:
sample_data = tsinfer.SampleData(path="mydata.samples") sample_data.add_site(position=1234, genotypes=[0, 0, 1, 0], alleles=["G", "C"]) sample_data.add_site(position=5678, genotypes=[1, 1, 1, 0], alleles=["A", "T"]) sample_data.finalise()
This creates a sample data file for four haploid samples and two sites, and saves it in the file “mydata.samples”. Note that the call to
finalise()
is essential here to ensure that all data will be correctly flushed to disk. For convenience, a context manager may also be used to ensure this is done:with tsinfer.SampleData(path="mydata.samples") as sample_data: sample_data.add_site(1234, [0, 0, 1, 0], ["G", "C"]) sample_data.add_site(5678, [1, 1, 1, 0], ["A", "T"])
More complex data models consisting of populations and polyploid individuals can also be specified. For example, we might have:
with tsinfer.SampleData(path="mydata.samples") as sample_data: # Define populations sample_data.add_population(metadata={"name": "CEU"}) sample_data.add_population(metadata={"name": "YRI"}) # Define individuals sample_data.add_individual(ploidy=2, population=0, metadata={"name": "NA12"}) sample_data.add_individual(ploidy=2, population=0, metadata={"name": "NA13"}) sample_data.add_individual(ploidy=2, population=0, metadata={"name": "NA14"}) sample_data.add_individual(ploidy=2, population=1, metadata={"name": "NA15"}) # Define sites and genotypes sample_data.add_site(1234, [0, 1, 1, 1, 0, 0, 0, 0], ["G", "C"]) sample_data.add_site(5678, [0, 0, 0, 0, 0, 0, 1, 1], ["A", "T"])
In this example we defined two populations and four diploid individuals, and so our genotypes arrays are of length eight. Thus, at first site the first individual is heterozygous, the second is homozygous with the derived allele and the other two individuals are homozygous with the ancestral allele. To illustrate how we can use site and population metadata to link up with external data sources we use the 1000 genomes identifiers (although of course the genotype data is fake). Here we suppose that we have the famous NA12878 trio from the CEU population, and one other individual from the YRI population. This metadata is then embedded in the final tree sequence that we infer, allowing us to use it conveniently in downstream analyses.
Note
If a
path
is specified, themax_file_size
option puts an upper limit on the possible size of the created file. On non-Windows systems, space for this file is not immediately allocated but just “reserved” using sparse file systems. However, on Windows systems the file is allocated immediately, somax_file_size
takes a smaller default value, to avoid allocating very large files for no reason. Users who wish to run large inferences on Windows may therefore need to explictly set an appropriatemax_file_size
. Note that themax_file_size
is only used while the file is being built: one the file has been finalised, it is shrunk to its minimum size.- Parameters:
sequence_length (float) – If specified, this is the sequence length that will be associated with the tree sequence output by
tsinfer.infer()
andtsinfer.match_samples()
. If provided site coordinates must be less than this value.path (str) – The path of the file to store the sample data. If None, the information is stored in memory and not persistent.
num_flush_threads (int) – The number of background threads to use for compressing data and flushing to disc. If <= 0, do not spawn any threads but use a synchronous algorithm instead. Default=0.
compressor (numcodecs.abc.Codec) – A
numcodecs.abc.Codec
instance to use for compressing data. Any codec may be used, but problems may occur with very large datasets on certain codecs as they cannot compress buffers >2GB. If None, do not use any compression. Default=:class:numcodecs.zstd.Zstd.chunk_size (int) – The chunk size used for zarr arrays. This affects compression level and algorithm performance. Default=1024.
max_file_size (int) – If a file is being used to store this data, set a maximum size in bytes for the stored file. If None, the default value of 1GiB (2**30 bytes) is used on Windows and 1TiB (2**40 bytes) on other platforms (see above for details).
- property sites_genotypes#
The “raw” genotypes array for each site, as passed in when adding sites. The values in this array correspond to indexes into the
sites_alleles
array.
- property sites_alleles#
The alleles list for each site, in the order given when adding sites. If missing data is present, the last allelic state will be
None
.
- property sites_ancestral_allele#
The index into each
sites_alleles
list which corresponds to the ancestral state. If the ancestral state is unknown, this is indicated by a value of tskit.MISSING_DATA (-1).
- data_equal(other)[source]#
Returns True if all the data attributes of this input file and the specified input file are equal. This compares every attribute except the provenance.
To compare two
SampleData
instances for exact equality of all data including provenance data, uses1 == s2
.- Parameters:
other (SampleData) – The other
SampleData
instance to compare with.- Returns:
True
if the data held in thisSampleData
instance is identical to the date held in the other instance.- Return type:
- assert_data_equal(other)[source]#
The same as
data_equal()
, but raises an assertion rather than returning False. This is useful for testing.
- subset(individuals=None, sites=None, *, sequence_length=None, **kwargs)[source]#
Returns a subset of this sample data file consisting of the specified individuals and sites. It is important to note that these are individual IDs and not sample IDs (corresponding to distinct haplotypes within an individual). When working with haploid data, the individual and sample IDs are guaranteed to be the same, and so can be used interchangably.
- Parameters:
individuals (arraylike) – The individual IDs to retain in the returned subset. IDs must be unique, and refer to valid individuals in the current dataset. IDs can be supplied in any order, and the order will be preserved in the returned data file (i.e.,
individuals[0]
will be the first individual in the new dataset, etc).sites (arraylike) – The site IDs to retain in the returned subset. IDs must be unique, and refer to valid sites in the current dataset. Site IDs can be supplied in any order, but the order will not be preserved in the returned data file, so that sites are always in position sorted order in the output.
sequence_length (float) – The sequence length to use for the returned object. If None, use the same sequence length as in the original sample data file.
**kwargs – Further arguments passed to the
SampleData
constructor.
- Returns:
A
SampleData
object.- Return type:
- min_site_times(individuals_only=False)[source]#
Returns a numpy array of the lower bound of the time of sites in the SampleData file. Each individual with a nonzero time (from the individuals_time array) gives a lower bound on the age of sites where the individual carries a derived allele.
- Returns:
A numpy array of the lower bound for each sites time.
- Return type:
numpy.ndarray(dtype=float64)
- classmethod from_tree_sequence(ts, use_sites_time=None, use_individuals_time=None, **kwargs)[source]#
Create a SampleData instance from the sample nodes in an existing tree sequence. Each sample node in the tree sequence results in a sample created in the returned object. Populations in the tree sequence will be copied into the returned object. Individuals in the tree sequence that are associated with any sample nodes will also be incorporated: the ploidy of each individual is assumed to be the number of sample nodes which reference that individual; individuals with no sample nodes are omitted. A new haploid individual is created for any sample node which lacks an associated individual in the existing tree sequence. Thus a tree sequence with
u
sample nodes but no individuals will be translated into a SampleData file withu
haploid individuals andu
samples.Metadata associated with individuals, populations, sites, and at the top level of the tree sequence, is also stored in the appropriate places in the returned SampleData instance. Any such metadata must either have a schema defined or be JSON encodable text. See the tskit documentation for more details on metadata schemas.
- Parameters:
ts (tskit.TreeSequence) – The
tskit.TreeSequence
from which to generate samples.use_sites_time (bool) – If
True
, the times of nodes in the tree sequence are used to set a time for each site (which affects the relative temporal order of ancestors during inference). Times for a site are only used if there is a single mutation at that site, in which case the node immediately below the mutation is taken as the origination time for the variant. IfFalse
, the frequency of the variant is used as a proxy for the relative variant time (seeadd_site()
). Defaults toFalse
.use_individuals_time (bool) – If
True
, use the time of the sample nodes in the tree sequence as the time of the individuals associated with those nodes in the sample data file. This is likely only to be meaningful ifuse_sites_time
is alsoTrue
. IfFalse
, all individuals are set to time 0. Defaults toFalse
.**kwargs – Further arguments passed to the
SampleData
constructor.
- Returns:
A
SampleData
object.- Return type:
- add_population(metadata=None)[source]#
Adds a new Population to this
SampleData
and returns its ID.All calls to this method must be made before individuals or sites are defined.
- add_individual(ploidy=1, metadata=None, population=None, location=None, time=0, flags=None)[source]#
Adds a new Individual to this
SampleData
and returns its ID and those of the resulting additional samples. Adding an individual with ploidyk
results ink
new samples being added, and each of these samples will be associated with the new individual. Each new sample will also be associated with the specified population ID. It is an error to specify a population ID that does not correspond to a population defined usingadd_population()
.All calls to this method must be made after populations are defined using
add_population()
and before sites are defined usingadd_site()
.- Parameters:
ploidy (int) – The ploidy of this individual. This corresponds to the number of samples added that refer to this individual. Defaults to 1 (haploid).
metadata (dict) – A JSON encodable dict-like object containing metadata that is to be associated with this individual.
population (int) – The ID of the population to associate with this individual (or more precisely, with the samples for this individual). If not specified or None, defaults to the null population (-1).
location (arraylike) – An array-like object defining n-dimensional spatial location of this individual. If not specified or None, the empty location is stored.
time (float) – The historical time into the past when the samples associated with this individual were taken. By default we assume that all samples come from the present time (i.e. the default time is 0).
flags (int) – The bitwise flags for this individual.
- Returns:
The ID of the newly added individual and a list of the sample IDs also added.
- Return type:
- add_site(position, genotypes, alleles=None, metadata=None, inference=None, time=None, ancestral_allele=None)[source]#
Adds a new site to this
SampleData
and returns its ID.At a minimum, the new site must specify the
position
andgenotypes
. Sites must be added in increasing order of position; duplicate positions are not supported. For each site a list ofalleles
may be supplied. This list defines the ancestral and derived states at the site. For example, if we setalleles=["A", "T"]
then the ancestral state is “A” and the derived state is “T”. The observed state for each sample is then encoded using thegenotypes
parameter. Thus if we haven
samples then this must be a one dimensional array-like object with lengthn
. Thegenotypes
index into the list ofalleles
, so that for a given arrayg
and sample indexj
,g[j]
should contain0
if samplej
carries the ancestral state at this site and1
if it carries the derived state. For multiple derived states, there may be more than 2alleles`, and ``g[j]
can be greater than1
, but such sites are not used for inference. All sites must have genotypes for the same number of samples.All populations and individuals must be defined before this method is called. If no individuals have been defined using
add_individual()
, the first call to this method addsn
haploid individuals, wheren
is the length of thegenotypes
array.- Parameters:
position (float) – The floating point position of this site. Must be less than the
sequence_length
if provided to theSampleData
constructor. Must be greater than all previously added sites.genotypes (arraylike) – An array-like object defining the sample genotypes at this site. The array of genotypes corresponds to the observed alleles for each sample, represented by indexes into the alleles array. Missing sample data can be represented by tskit.MISSING_DATA in this array. The input is converted to a numpy array with dtype
np.int8
; therefore, for maximum efficiency ensure that the input array is also of this type.alleles (list(str)) – A list of strings defining the alleles at this site. Only biallelic sites can currently be used for inference. Sites with 3 or more non-missing alleles cannot have
inference
(below) set toTrue
. If missing data is present in thegenotypes
array, the stored list of alleles will be modified as necessary so thatalleles[tskit.MISSING_DATA] == None
. Ifalleles
is not specified or None, a default of [“0”, “1”] is used.metadata (dict) – A JSON encodable dict-like object containing metadata that is to be associated with this site.
time (float) – The time of occurence (pastwards) of the mutation to the derived state at this site. If not specified or None, the frequency of the derived alleles (i.e., the proportion of non-zero values in the genotypes, out of all the non-missing values) will be used in inference. For biallelic sites this frequency should provide a reasonable estimate of the relative time, as used to order ancestral haplotypes during the inference process. For sites not used in inference, such as singletons or sites with more than two alleles or when the time is specified as
np.nan
, then the value is unused. Defaults to None.ancestral_allele (int) – A positive index into the alleles array, specifying which allele is the ancestral state, or
tskit.MISSING_DATA
(-1) if the ancestral state is unknown (in which case the site will not be used for inference, and the ancestral state will be inferred using parsimony). Default:None
, treated as0
, so that the first allele in the list is taken as the ancestral state.
- Returns:
The ID of the newly added site.
- Return type:
- finalise()[source]#
Ensures that the state of the data is flushed and writes the provenance for the current operation. The specified ‘command’ is used to fill the corresponding entry in the provenance dictionary.
- merge(other, **kwargs)[source]#
Returns a copy of this SampleData file merged with the specified other SampleData file. Subsequent keyword arguments are passed to the SampleData constructor for the returned merged dataset.
The datasets are merged by following process:
We add the populations from this dataset to the result, followed by the populations from other. Population references from the two datasets are updated accordingly.
We add individual data from this dataset to the result, followed by the individuals from the other dataset.
We merge the variant data from the two datasets by comparing sites by their position. If two sites in the datasets have the same position we combine the genotype data. The alleles from this dataset are updated to include any new alleles in other, and we then combine and update the genotypes accordingly. It is an error if sites with the same position have different ancestral state, time, or metadata values. For sites that exist in one dataset and not the other, we insert the site with
tskit.MISSING_DATA
present in the genotypes for the dataset that does not contain the site.We add the provenances for this dataset, followed by the provenances for the other dataset.
- Parameters:
other (SampleData) – The other
SampleData
instance to to merge.- Returns:
A new SampleData instance which contains the merged data from the two datasets.
- Return type:
- sites(ids=None)[source]#
Returns an iterator over the Site objects. A subset of the sites can be returned using the
ids
parameter. This must be a list of integer site IDs.
- num_alleles(sites=None)[source]#
Returns a numpy array of the number of alleles at each site. Missing data is not counted as an allele.
- Parameters:
sites (array) – A numpy array of sites for which to return data. If None (default) return all sites.
- Returns:
A numpy array of the number of alleles at each site.
- Return type:
numpy.ndarray(dtype=uint32)
- variants(sites=None, recode_ancestral=None)[source]#
Returns an iterator over the
Variant
objects. This is equivalent to thetskit.TreeSequence.variants()
iterator. If recode_ancestral isTrue
, the.alleles
attribute of each variant is guaranteed to return the alleles in an order such that the ancestral state is the first item in the list. In this case,variant.alleles
may list the alleles in a different order from the input order as listed invariant.site.alleles
, and the values in genotypes array will be recoded so that the ancestral state will have a genotype of 0. If the ancestral state is unknown, the original input order is kept.If a variant contains missing data, it is guaranteed that the alleles attribute for that variant satisfies
alleles[tskit.MISSING_DATA] == None
.- Parameters:
sites (array) – A numpy array of ascending site ids for which to return data. If None (default) return all sites.
recode_ancestral (bool) – If True, recode genotypes at sites where the ancestral state is known such that the ancestral state is coded as 0, as described above. Otherwise return genotypes in the input allele encoding. Default:
None
treated asFalse
.
- Returns:
An iterator over the variants in the sample data file.
- Return type:
iter(
Variant
)
- haplotypes(samples=None, sites=None, recode_ancestral=None)[source]#
Returns an iterator over the (sample_id, haplotype) pairs. Each haplotype is an array of indexes, where the
i
th value is an index into the alleles list for thei
th specified site (but see warning below).Warning
If
recode_ancestral=True
, the haplotype values may not correspond to indexes into thesites.alleles
list. Instead, they will correspond to thevariant.alleles
list, returned when iterating overvariants()
usingvariants(recode_ancestral=True)
.- Parameters:
samples (list) – The sample IDs for which haplotypes are returned. If
None
, return haplotypes for all sample nodes, otherwise this may be a numpy array (or array-like) object (converted to dtype=np.int32).sites (array) – A numpy array of sites to use, or
None
for all sites.recode_ancestral (bool) – If
True
, recode genotypes so that the ancestral state is coded as 0 as described undervariants()
. Otherwise return genotypes in the input allele encoding. Default:None
, treated asFalse
.
- Returns:
An iterator over (sample_id, haplotype) pairs.
- Return type:
iter(int, numpy.ndarray(dtype=int8))
- add_provenance(timestamp, record)#
Adds a new provenance record with the specified timestamp and record. Timestamps should ISO8601 formatted, and record is some JSON encodable object.
- arrays()#
Returns a list of all the zarr arrays in this DataContainer.
- clear_provenances()#
Clear all provenances in this instance
- close()#
Close this DataContainer. Any read or write operations attempted after calling this will fail.
- copy(path=None, max_file_size=None)#
Returns a copy of this DataContainer opened in ‘edit’ mode. If path is specified, this must not be equal to the path of the current data container.
- property file_size#
Returns the size of the underlying file, or -1 if we do not have a file associated.
- property info#
Returns a string containing the zarr info for each array.
- provenances()#
Returns an iterator over the (timestamp, record) pairs representing the provenances for this data container.
- record_provenance(command=None, **kwargs)#
Records the provenance information for this file using the tskit provenances schema.
Todo
Add documentation for the data attributes in read-mode.
Define copy mode.
Provide example of updating inference_sites
Ancestor data#
- class tsinfer.AncestorData(position, sequence_length, *, path=None, num_flush_threads=0, compressor=None, chunk_size=1024, max_file_size=None)[source]#
Class representing the stored ancestor data produced by
generate_ancestors()
. See the ancestor data file format specifications for details on the structure of this file.See the documentation for
SampleData
for a discussion of themax_file_size
parameter.- Parameters:
position (arraylike) – Integer array of the site positions of the ancestors. All values should be >0 and the array should be monotonically increasing.
sequence_length (float) – Total length of the sequence, site positions must be less than this value.
path (str) – The path of the file to store the ancestor data. If None, the information is stored in memory and not persistent.
num_flush_threads (int) – The number of background threads to use for compressing data and flushing to disc. If <= 0, do not spawn any threads but use a synchronous algorithm instead. Default=0.
compressor (numcodecs.abc.Codec) – A
numcodecs.abc.Codec
instance to use for compressing data. Any codec may be used, but problems may occur with very large datasets on certain codecs as they cannot compress buffers >2GB. If None, do not use any compression. Default=:class:numcodecs.zstd.Zstd.chunk_size (int) –
The chunk size used for zarr arrays in the sample dimension. This affects compression level and algorithm performance. Default=1024.
chunk_size_sites (int) –
The chunk size used for the genotype zarr arrays in the sites dimension. This affects compression level and algorithm performance. Default=16384.
max_file_size (int) – If a file is being used to store this data, set a maximum size in bytes for the stored file. If None, the default value of 1GiB is used on Windows and 1TiB on other platforms (see above for details).
- data_equal(other)[source]#
Returns True if all the data attributes of this input file and the specified input file are equal. This compares every attribute.
- property sequence_length#
Returns the sequence length.
- property num_sites#
The number of inference sites used to generate the ancestors
- property sites_position#
The positions of the inference sites used to generate the ancestors
- add_provenance(timestamp, record)#
Adds a new provenance record with the specified timestamp and record. Timestamps should ISO8601 formatted, and record is some JSON encodable object.
- arrays()#
Returns a list of all the zarr arrays in this DataContainer.
- clear_provenances()#
Clear all provenances in this instance
- close()#
Close this DataContainer. Any read or write operations attempted after calling this will fail.
- copy(path=None, max_file_size=None)#
Returns a copy of this DataContainer opened in ‘edit’ mode. If path is specified, this must not be equal to the path of the current data container.
- property file_size#
Returns the size of the underlying file, or -1 if we do not have a file associated.
- property info#
Returns a string containing the zarr info for each array.
- provenances()#
Returns an iterator over the (timestamp, record) pairs representing the provenances for this data container.
- record_provenance(command=None, **kwargs)#
Records the provenance information for this file using the tskit provenances schema.
- property ancestors_length#
Returns the length of ancestors in physical coordinates.
- insert_proxy_samples(sample_data, *, sample_ids=None, epsilon=None, allow_mutation=False, require_same_sample_data=True, **kwargs)[source]#
Take a set of samples from a
sample_data
instance and create additional “proxy sample ancestors” from them, returning a newAncestorData
instance including both the current ancestors and the additional ancestors at the appropriate time points.A proxy sample ancestor is an ancestor based upon a known sample. At sites used in the full inference process, the haplotype of this ancestor is identical to that of the sample on which it is based. The time of the ancestor is taken to be a fraction
epsilon
older than the sample on which it is based.A common use of this function is to provide ancestral nodes for anchoring historical samples at the correct time when matching them into a tree sequence during the
tsinfer.match_samples()
stage of inference. For this reason, by default, the samples chosen fromsample_data
are those associated with historical (i.e. non-contemporary) individuals. This can be altered by using thesample_ids
parameter.Note
The proxy sample ancestors inserted here will correspond to extra nodes in the inferred tree sequence. At sites which are not used in the full inference process (e.g. sites unique to a single historical sample), these proxy sample ancestor nodes may have a different genotype from their corresponding sample.
- Parameters:
sample_data (SampleData) – The
SampleData
instance from which to select the samples used to create extra ancestors.sample_ids (list(int)) – A list of sample ids in the
sample_data
instance that will be selected to create the extra ancestors. IfNone
(default) select all the historical samples, i.e. those associated with an Individual whose time is greater than zero. The order of ids is ignored, as are duplicate ids.epsilon (list(float)) – An list of small time increments determining how much older each proxy sample ancestor is than the corresponding sample listed in
sample_ids
. A single value is also allowed, in which case it is used as the time increment for all selected proxy sample ancestors. If None (default) find \({\delta}t\), the smallest time difference between between the sample times and the next oldest ancestor in the currentAncestorData
instance, settingepsilon
= \({\delta}t / 100\) (or, if all selected samples are at least as old as the oldest ancestor, take \({\delta}t\) to be the smallest non-zero time difference between existing ancestors).allow_mutation (bool) – If
False
(the default), any site in a proxy sample ancestor that has a derived allele must have a pre-existing mutation in an older (non-proxy) ancestor, otherwise an error is raised. Alternatively, ifallow_mutation
isTrue
, proxy ancestors can contain a de-novo mutation at a site that also has a mutation elsewhere (i.e. breaking the infinite sites assumption), allowing them to possess derived alleles at sites where there are no pre-existing mutations in older ancestors.require_same_sample_data (bool) – Deprecated Has no effect.
**kwargs – Further arguments passed to the constructor when creating the new
AncestorData
instance which will be returned.
- Returns:
A new
AncestorData
object.- Return type:
- truncate_ancestors(lower_time_bound, upper_time_bound, length_multiplier=2, buffer_length=1000, **kwargs)[source]#
Truncates the length of ancestors above a given time and returns a new
AncestorData
instance.Given a set of haplotypes H such that
lower_time_bound
<=h.time
<upper_time_bound
, we letmax_len = length_multiplier * max(max(h.length) for h in H)
. Then, we truncate all haplotypes containing at least one focal site whereh.time >= upper
, ensuring these haplotypes extend no further than half ofmax_len
to the either side of the leftmost and rightmost focal sites of the ancestral haplotype. Note that ancestors aboveupper_time_bound
may still be longer thanmax_len
if the ancestor contains greater than 2 focal sites.This function should be used when
tsinfer.generate_ancestors()
generates old ancestors which are very long, as these can significantly slow down matching time. Older ancestors should generally be shorter than younger ancestors, so truncating the lengths of older ancestors has negligible effect on inference accuracy.Note
Please ensure that the time values provided to
lower_time_bound
andupper_time_bound
match the units used in theAncestorData
file, i.e. if your ancestors do not have site times specified,upper_time_bound
should be between 0 and 1.- Parameters:
lower_time_bound (float) – Defines the lower bound (inclusive) of the half open interval where we search for a truncation value.
upper_time_bound (float) – Defines the upper bound (exclusive) of the half open interval where we search for a truncation value. The truncation value is the length of the longest haplotype in this interval multiplied by
length_multiplier
. The length of ancestors as old or older thanupper_time_bound
will be truncated using this value.length_multiplier (float) – A multiplier for the length of the longest ancestor in the half-open interval between
lower_time_bound
(inclusive) anduppper_time_bound
(exclusive), i.e. if the longest ancestor in the interval is 1 megabase, alength_multiplier
of 2 creates a maximum length of 2 megabases.buffer_length (int) – The number of changed ancestors to buffer before writing to disk.
**kwargs – Further arguments passed to the
AncestorData.copy()
when creating the newAncestorData
instance which will be returned.
- Returns:
A new
AncestorData
object.- Return type:
- add_ancestor(start, end, time, focal_sites, haplotype)[source]#
Adds an ancestor with the specified haplotype, with ancestral material over the interval [start:end], that is associated with the specified timepoint and has new mutations at the specified list of focal sites. Ancestors should be added in time order, with the oldest first. The id of the added ancestor is returned.
Todo
Add documentation for the data attributes in read-mode.
Running inference#
- tsinfer.infer(sample_data, *, recombination_rate=None, mismatch_ratio=None, path_compression=True, exclude_positions=None, post_process=None, num_threads=0)[source]#
Runs the full inference pipeline on the specified
SampleData
instance and returns the inferredtskit.TreeSequence
. See matching ancestors & samples in the documentation for details ofrecombination_rate
,mismatch_ratio
andpath_compression
.Note
For finer grained control over inference, for example to set mismatch ratios when matching ancestors as well as when matching samples, run
tsinfer.generate_ancestors()
,tsinfer.match_ancestors()
andtsinfer.match_samples()
separately.- Parameters:
sample_data (SampleData) – The input
SampleData
instance representing the observed data that we wish to make inferences from.recombination_rate (float, msprime.RateMap) – Either a floating point value giving a constant rate \(\rho\) per unit length of genome, or an
msprime.RateMap
object. This is used to calculate the probability of recombination between adjacent sites in the match_samples stage. IfNone
, all matching conflicts are resolved by recombination and all inference sites will have a single mutation (equivalent to mismatch_ratio near zero).mismatch_ratio (float) – The probability of a mismatch relative to the median probability of recombination between adjacent sites: can only be used if a recombination rate has been set (default:
None
treated as 1 ifrecombination_rate
is set). This is only applied in the match_samples stage.path_compression (bool) – Whether to merge edges that share identical paths (essentially taking advantage of shared recombination breakpoints).
post_process (bool) – Whether to run the
post_process()
method on the the tree sequence which, among other things, removes ancestral material that does not end up in the current samples (if not specified, defaults toTrue
)exclude_positions (array_like) – A list of site positions to exclude for full inference. Sites with these positions will not be used to generate ancestors, and not used during the copying process. Any such sites that exist in the sample data file will be included in the trees after the main inference process using parsimony. The list does not need to be in to be in any particular order, and can include site positions that are not present in the sample data file.
num_threads (int) – The number of worker threads to use in parallelised sections of the algorithm. If <= 0, do not spawn any threads and use simpler sequential algorithms (default).
simplify (bool) – When post_processing, only simplify the tree sequence. deprecated but retained for backwards compatibility (default:
None
).
- Returns:
The
tskit.TreeSequence
object inferred from the input sample data.- Return type:
- tsinfer.generate_ancestors(sample_data, *, path=None, exclude_positions=None, num_threads=0, genotype_encoding=None, mmap_temp_dir=None, **kwargs)[source]#
Runs the ancestor generation algorithm on the specified
SampleData
instance and returns the resultingAncestorData
instance. If you wish to store generated ancestors persistently on file you must pass thepath
keyword argument to this function. For example,ancestor_data = tsinfer.generate_ancestors(sample_data, path="mydata.ancestors")
Other keyword arguments are passed to the
AncestorData
constructor, which may be used to control the storage properties of the generated file.Ancestor generation involves loading the entire genotype matrix into memory, by default using one byte per haploid genotype, which can be prohibitively large when working with sample sizes of 100,000 or more. There are two options to help mitigate memory usage. The
genotype_encoding
parameter allows the user to specify a more compact encoding scheme, which reduces storage space for datasets with small numbers of alleles. Currently, theGenotypeEncoding.ONE_BIT
encoding is supported, which provides 8-fold compression of biallelic, non-missing data. An error is raised if an encoding that does not support the range of values present in a given dataset is provided.The second option for reducing the RAM footprint of this function is to use the
mmap_temp_dir
parameter. This allows the genotype data to be cached on file, transparently using the operating system’s virtual memory subsystem to swap in and out the data. This can work well if the encoded genotype matrix almost fits in RAM and fast local storage is available. However, if the size of the encoded genotype matrix is more than, say, twice the available RAM it is unlikely that this function will complete in a reasonable time-frame. A temporary file is created in the specifiedmmap_temp_dir
, which is automatically deleted when the function completes.Warning
The
mmap_temp_dir
option is a silent no-op on Windows!- Parameters:
sample_data (SampleData) – The
SampleData
instance that we are genering putative ancestors from.path (str) – The path of the file to store the sample data. If None, the information is stored in memory and not persistent.
exclude_positions (array_like) – A list of site positions to exclude for full inference. Sites with these positions will not be used to generate ancestors, and not used during the copying process. The list does not need be in any particular order.
num_threads (int) – The number of worker threads to use. If < 1, use a simpler synchronous algorithm.
genotype_encoding (int) – The encoding to use for genotype data internally when generating ancestors. See the
GenotypeEncoding
class for the available options. Defaults to one-byte per genotype.mmap_temp_dir (str) – The directory within which to create the temporary backing file when using mmaped memory for bulk genotype storage. If None (the default) allocate memory directly using the standard mechanism. This is an advanced option, usually only relevant when working with very large datasets (see above for more information).
- Returns:
The inferred ancestors stored in an
AncestorData
instance.- Return type:
- class tsinfer.GenotypeEncoding(value)[source]#
The encoding scheme used to store genotypes.
- EIGHT_BIT = 0#
The default approach of using one-byte per genotype. Supports up to 127 alleles and missing data.
- ONE_BIT = 1#
Encode binary genotype data using a single bit.
- tsinfer.match_ancestors(sample_data, ancestor_data, *, recombination_rate=None, mismatch_ratio=None, path_compression=True, num_threads=0)[source]#
Run the ancestor matching algorithm on the specified
SampleData
andAncestorData
instances, returning the resultingtskit.TreeSequence
representing the complete ancestry of the putative ancestors. See matching ancestors & samples in the documentation for details ofrecombination_rate
,mismatch_ratio
andpath_compression
.- Parameters:
sample_data (SampleData) – The
SampleData
instance representing the input data.ancestor_data (AncestorData) – The
AncestorData
instance representing the set of ancestral haplotypes for which we are finding a history.recombination_rate (float, msprime.RateMap) – Either a floating point value giving a constant rate \(\rho\) per unit length of genome, or an
msprime.RateMap
object. This is used to calculate the probability of recombination between adjacent sites. IfNone
, all matching conflicts are resolved by recombination and all inference sites will have a single mutation (equivalent to mismatch_ratio near zero)mismatch_ratio (float) – The probability of a mismatch relative to the median probability of recombination between adjacent sites: can only be used if a recombination rate has been set (default:
None
treated as 1 ifrecombination_rate
is set).path_compression (bool) – Whether to merge edges that share identical paths (essentially taking advantage of shared recombination breakpoints).
num_threads (int) – The number of match worker threads to use. If this is <= 0 then a simpler sequential algorithm is used (default).
- Returns:
The ancestors tree sequence representing the inferred history of the set of ancestors.
- Return type:
- tsinfer.match_samples(sample_data, ancestors_ts, *, recombination_rate=None, mismatch_ratio=None, path_compression=True, post_process=None, indexes=None, force_sample_times=False, num_threads=0)[source]#
Runs the sample matching algorithm on the specified
SampleData
instance and ancestors tree sequence, returning the finaltskit.TreeSequence
instance containing the full inferred history for all samples and sites. See matching ancestors & samples in the documentation for details ofrecombination_rate
,mismatch_ratio
andpath_compression
.- Parameters:
sample_data (SampleData) – The
SampleData
instance representing the input data.ancestors_ts (tskit.TreeSequence) – The
tskit.TreeSequence
instance representing the inferred history among ancestral ancestral haplotypes.recombination_rate (float, msprime.RateMap) – Either a floating point value giving a constant rate \(\rho\) per unit length of genome, or an
msprime.RateMap
object. This is used to calculate the probability of recombination between adjacent sites. IfNone
, all matching conflicts are resolved by recombination and all inference sites will have a single mutation (equivalent to mismatch_ratio near zero)mismatch_ratio (float) – The probability of a mismatch relative to the median probability of recombination between adjacent sites: can only be used if a recombination rate has been set (default:
None
treated as 1 ifrecombination_rate
is set).path_compression (bool) – Whether to merge edges that share identical paths (essentially taking advantage of shared recombination breakpoints).
indexes (array_like) – An array of indexes into the sample_data file of the samples to match (in increasing order) or None for all samples.
post_process (bool) – Whether to run the
post_process()
method on the the tree sequence which, among other things, removes ancestral material that does not end up in the current samples (if not specified, defaults toTrue
)force_sample_times (bool) – After matching, should an attempt be made to adjust the time of “historical samples” (those associated with an individual having a non-zero time) such that the sample nodes in the tree sequence appear at the time of the individual with which they are associated.
num_threads (int) – The number of match worker threads to use. If this is <= 0 then a simpler sequential algorithm is used (default).
simplify (bool) – Treated as an alias for
post_process
, deprecated but currently retained for backwards compatibility if set toFalse
.
- Returns:
The tree sequence representing the inferred history of the sample.
- Return type:
- tsinfer.augment_ancestors(sample_data, ancestors_ts, indexes, *, recombination_rate=None, mismatch_ratio=None, path_compression=True, num_threads=0)[source]#
Runs the sample matching algorithm on the specified
SampleData
instance and ancestors tree sequence, for the specified subset of sample indexes, returning thetskit.TreeSequence
instance including these samples. This tree sequence can then be used as an ancestors tree sequence for subsequent matching against all samples. See matching ancestors & samples in the documentation for details ofrecombination_rate
,mismatch_ratio
andpath_compression
.- Parameters:
sample_data (SampleData) – The
SampleData
instance representing the input data.ancestors_ts (tskit.TreeSequence) – The
tskit.TreeSequence
instance representing the inferred history among ancestral ancestral haplotypes.indexes (array) – The sample indexes to insert into the ancestors tree sequence, in increasing order.
recombination_rate (float, msprime.RateMap) – Either a floating point value giving a constant rate \(\rho\) per unit length of genome, or an
msprime.RateMap
object. This is used to calculate the probability of recombination between adjacent sites. IfNone
, all matching conflicts are resolved by recombination and all inference sites will have a single mutation (equivalent to mismatch_ratio near zero)mismatch_ratio (float) – The probability of a mismatch relative to the median probability of recombination between adjacent sites: can only be used if a recombination rate has been set (default:
None
treated as 1 ifrecombination_rate
is set).path_compression (bool) – Whether to merge edges that share identical paths (essentially taking advantage of shared recombination breakpoints).
num_threads (int) – The number of match worker threads to use. If this is <= 0 then a simpler sequential algorithm is used (default).
- Returns:
The specified ancestors tree sequence augmented with copying paths for the specified sample.
- Return type:
- tsinfer.post_process(ts, *, split_ultimate=None, erase_flanks=None)[source]#
Post-process a tsinferred tree sequence into a more conventional form. This is the function run by default on the final tree sequence output by
match_samples()
. It involves the following 4 steps:If the oldest node is connected to a single child via an edge that spans the entire tree sequence, this oldest node is removed, so that its child becomes the new root (this step is undertaken to remove the “virtual-root-like node” which is added to ancestor tree sequences to enable matching).
If the oldest node is removed in the first step and the new root spans the entire genome, it is treated as the “ultimate ancestor” and (unless
split_ultimate
isFalse
) the node is split into multiple coexisiting nodes with the splits occurring whenever the children of the ultimate ancestor change. The rationale is that tsinfer creates a single ancestral haplotype with all inference sites in the ancestral state: this is, however, unlikely to represent a single ancestor in the past. If the tree sequence is then dated, the fact that ultimate ancestor is split into separate nodes allows these nodes to be dated to different times.Often, extensive regions of genome exist before the first defined site and after the last defined site. Since no data exists in these sections of the genome, post processing by default erases the inferred topology in these regions. However, if
erase_flanks
is False, the flanking regions at the start and end will be assigned the same topology as inferred at the first and last site respectively.The sample nodes are reordered such that they are the first nodes listed in the node table, removing tree nodes and edges that are not on a path between the root and any of the samples (by applying the
simplify()
method withkeep_unary
set to True butfilter_sites
,filter_populations
andfilter_individuals
set to False).
- Parameters:
split_ultimate (bool) – If
True
(default) and the oldest node is the only parent to a single “ultimate ancestor” node, attempt to split this node into many separate nodes (see above). IfFalse
do not attempt to identify or split an ultimate ancestor node.erase_flanks (bool) – If
True
(default), keep only the inferred topology between the first and last sites. IfFalse
, output regions of topology inferred before the first site and after the last site.
- Returns:
The post-processed tree sequence.
- Return type: