Historical (Ancient) Samples#
Sometimes you may wish to infer and date a genetic genealogy from data which includes historical samples, whose time is older that the current generation (i.e. sample nodes with times > 0).
The output of tsinfer
is valid regardless
of the inclusion of historical samples, but dating such a tree sequence
is more complicated. This is because the time scale of a tsinferred
tree sequence is uncalibrated, so it is unclear where in time to
place any historical samples.
Note
We are currently working on methods to make dating historical samples much more convenient.
The 2 step approach#
Currently, the best way to date such tree sequences is
to perform a two step process, in which inference and dating is first
performed on the modern samples in order to establish a timescale, followed by
adding the historical samples and re-inferring using the dated timescale as a
basis for site ages in the tsinfer
algorithm. The re-inferred tree sequence
can then be redated. The following recipe shows how this is accomplished
with a few lines of Python. The only requirement is a tsinfer.SampleData file with
modern and historical samples (the latter are specified using the
individuals_time
array in a tsinfer.SampleData file).
Note
While tsinfer
does not make assumptions about the
time of sample nodes, tsdate requires a prior
which is calculated from the conditional coalescent on the assumption that all
samples are at time 0. Although this is not the case if there are historical samples,
testing shows that it is still a close enough approximation for the method to
work well.
import logging
import msprime
import tsinfer
import tsdate
Ne = 1e4
mu = 1e-7 # mutation rate
def make_historical_data(num_modern, num_historical, random_seed, historical_time = 10000):
# Returns two SampleData files, one with only moderns and one with everyone
samples = [
msprime.SampleSet(num_modern),
msprime.SampleSet(num_historical, time=historical_time)
]
ts = msprime.sim_ancestry(
samples=samples,
population_size=Ne,
recombination_rate=1e-7,
sequence_length=1e5,
random_seed=random_seed
)
ts = msprime.sim_mutations(ts, rate=mu, random_seed=random_seed)
## Make the SampleData files directly from the tree sequence rather than going via VCF
all_samples = tsinfer.SampleData.from_tree_sequence(
ts, use_sites_time=True, use_individuals_time=True)
modern_samples = tsinfer.SampleData.from_tree_sequence(
ts.simplify(ts.samples(time=0), filter_sites=False))
# NB: if importing from a VCF you would create the modern_samples by subsetting all_samples:
# modern_samples = all_samples.subset(np.where(all_samples.individuals_time[:] == 0)[0])
return modern_samples, all_samples
def infer_and_date_modern_and_ancient(modern_samples, all_samples):
# Date the moderns
inferred_ts = tsinfer.infer(modern_samples)
tsdate_logger = logging.getLogger('tsdate') # temporarily disable the warning
tsdate_logger.setLevel(logging.CRITICAL)
inferred_ts = tsdate.preprocess_ts(inferred_ts)
tsdate_logger.setLevel(logging.WARNING)
dated_modern_ts = tsdate.date(inferred_ts, mutation_rate=mu)
# Find the dates for each site and use those rather than frequency
sites_time = tsdate.sites_time_from_ts(dated_modern_ts) # Get tsdate site age estimates
dated_samples = tsdate.add_sampledata_times(all_samples, sites_time)
ancestors = tsinfer.generate_ancestors(dated_samples)
ancestors_w_proxy = ancestors.insert_proxy_samples(
dated_samples, allow_mutation=True)
ancestors_ts = tsinfer.match_ancestors(dated_samples, ancestors_w_proxy)
return tsinfer.match_samples(dated_samples, ancestors_ts, force_sample_times=True)
dated_ts = infer_and_date_modern_and_ancient(*make_historical_data(2, 1, random_seed=1234))
In the code above, we simulate a tree sequence with six sample chromosomes, four modern and
two historical. We then infer and date a tree sequence using only the modern
samples. Next, we find derived alleles which are carried by the historical samples and
use the age of the historical samples to constrain the ages of these alleles. Finally,
we reinfer the tree sequence, using the date estimates from tsdate and the historical
constraints rather than the frequency of the alleles to order mutations in tsinfer
.
Historical samples are added to the ancestors tree sequence as
proxy nodes, in addition to being used as samples
.