Python API

This page provides detailed documentation for the methods and classes available in pyslim. Here is a quick reference to some of the methods:

recapitate(ts[, ancestral_Ne])

Returns a “recapitated” tree sequence, by using msprime to run a coalescent simulation from the “top” of this tree sequence, i.e., allowing any uncoalesced lineages to coalesce.

convert_alleles(ts)

Returns a modified tree sequence in which alleles have been replaced by their corresponding nucleotides.

generate_nucleotides(ts[, …])

Returns a modified tree sequence in which mutations have been randomly assigned nucleotides and (optionally) a reference sequence has been randomly generated.

update_tables(tables)

Update tables produced by a previous verion of SLiM to the current file version.

population_size(ts, x_bins, y_bins, time_bins)

Calculates the population size in each of the spatial bins defined by grid lines at x_bins and y_bins, averaged over each of the time intervals separated by time_bins.

default_slim_metadata(name)

Returns default metadata of type name, where name is one of “tree_sequence”, “edge”, “site”, “mutation”, “mutation_list_entry”, “node”, “individual”, or “population”.

Editing or adding to tree sequences

pyslim provides tools for transforming tree sequences:

pyslim.recapitate(ts, ancestral_Ne=None, **kwargs)[source]

Returns a “recapitated” tree sequence, by using msprime to run a coalescent simulation from the “top” of this tree sequence, i.e., allowing any uncoalesced lineages to coalesce.

To allow recapitation to be done correctly, the nodes of the first generation of the SLiM simulation from whom all samples inherit are still present in the tree sequence, but are not marked as samples. If you simplify the tree sequence before recapitating you must ensure these are not removed, which you do by passing the argument keep_input_roots=True to simplify.

If you specify an ancestral_Ne, then the recapitated portion of the tree sequence will be simulated in a single population with this (diploid) size. In other words, all lineages are moved to a single population of this size (named “ancestral” if this name is not already taken), and coalescence is allowed to happen.

You may control the ancestral demography by passing in a demography argument: see msprime.sim_ancestry().

In general, all defaults are whatever the defaults of {meth}`msprime.sim_ancestry` are; this includes recombination rate, so that if neither recombination_rate or a recombination_map are provided, there will be no recombination.

Parameters
  • ts (TreeSequence) – The tree sequence to transform.

  • ancestral_Ne (float) – If specified, then will simulate from a single ancestral population of this size. It is an error to specify this as well as demography.

  • kwargs (dict) – Any other arguments to msprime.sim_ancestry().

pyslim.convert_alleles(ts)[source]

Returns a modified tree sequence in which alleles have been replaced by their corresponding nucleotides. For sites, SLiM-produced tree sequences have “” (the empty string) for the ancestral state at each site; this method will replace this with the corresponding nucleotide from the reference sequence. For mutations, SLiM records the ‘derived state’ as a SLiM mutation ID; this method will this with the nucleotide from the mutation’s metadata.

This operation is not reversible: since SLiM mutation IDs are lost, the tree sequence will not be able to be read back into SLiM.

The main purpose of this method is for output: for instance, this code will produce a VCF file with nucleotide alleles:

nts = pyslim.convert_alleles(ts)
with open('nucs.vcf', 'w') as f:
    nts.write_vcf(f)

This method will produce an error if the tree sequence does not have a valid reference sequence or if any mutations do not have nucleotides: to first generate these, see generate_nucleotides().

Parameters

ts (TreeSequence) – The tree sequence to transform.

pyslim.generate_nucleotides(ts, reference_sequence=None, keep=True, seed=None)[source]

Returns a modified tree sequence in which mutations have been randomly assigned nucleotides and (optionally) a reference sequence has been randomly generated.

If reference_sequence is a string of nucleotides (A, C, G, and T) of length equal to the sequence length, this is used for the reference sequence. Otherwise (the default), a reference sequence of independent and uniformly random nucleotides is generated.

SLiM stores the nucleotide as an integer in the mutation metadata, with -1 meaning “not a nucleotide mutation”. This method assigns nucleotides by stepping through each mutation and picking a random nucleotide uniformly out of the three possible nucleotides that differ from the parental state (i.e., the derived state of the parental mutation, or the ancestral state if the mutation has no parent). If keep=True (the default), the mutations that already have a nucleotide (i.e., an integer 0-3 in metadata) will not be modified.

Technical note: in the case of stacked mutations, the SLiM mutation that determines the nucleotide state of the (tskit) mutation is the one with the largest slim_time attribute. This method tries to assign nucleotides so that each mutation differs from the previous state, but this is not always possible in certain unlikely cases.

Parameters
  • ts (TreeSequence) – The tree sequence to transform.

  • reference_sequence (bool) – A reference sequence, or None to randomly generate one.

  • keep (bool) – Whether to leave existing nucleotides in mutations that already have one.

  • seed (int) – The random seed for generating new alleles.

pyslim.update_tables(tables)[source]

Update tables produced by a previous verion of SLiM to the current file version.

Summarizing tree sequences

Additionally, pyslim contains the following methods:

pyslim.population_size(ts, x_bins, y_bins, time_bins, stage='late', remembered_stage=None)[source]

Calculates the population size in each of the spatial bins defined by grid lines at x_bins and y_bins, averaged over each of the time intervals separated by time_bins. To obtain actual (census) sizes, the tree sequence must contain all individuals alive, e.g., from a SLiM simulation with all individuals permanently remembered.

With nx, ny and nt the number of bins in the x, y and time directions (so, nx is one less than the length of x_bins), returns a 3-d array with dimensions (nx, ny, nt). The (i,j,k)``th element of the array is the average number of individuals with ``x coordinate in the half-open interval [x_bins[i], x_bins[i + 1]) and y coordinate in the half-open interval [y_bins[i], y_bins[i + 1]), averaged across all times in the half-open interval [time_bins[k], time_bins[k + 1]).

For integer endpoints of the time bins, this average is equivalent to recording the number of indivduals that are alive at each time in the time interval and have location in the relevant location bin, then taking the mean of these recorded population sizes.

Parameters
  • ts (TreeSequence) – The tree sequence to calculate population size from.

  • x_bins (array) – The x-coordinates of the boundaries of the location bins.

  • y_bins (array) – The y-coordinates of the boundaries of the location bins.

  • time_bins (array) – The endpoints of the time bins.

  • stage (str) – The stage in the SLiM life cycle that the endpoints of the time bins refer to (either “early” or “late”; defaults to “late”).

  • remembered_stage (str) – The stage in the SLiM life cycle during which individuals were Remembered (defaults to the stage the tree sequence was recorded at, stored in metadata).

Additions to the tree sequence

The tskit.TreeSequence class represents a sequence of trees that describes the genealogical history of a population. Here, we describe pyslim’s additions to this class.

The SLiM Tree Sequence

class pyslim.SlimTreeSequence[source]

This is just like a tskit.TreeSequence, with a few more properties and methods, including:

  • slim_generation - the SLiM “generation” at the end of the simulation

  • reference_sequence - if using a nucleotide model, the reference sequence

  • individual_locations - numpy array of individual locations

  • individual_ages - numpy array of individiual ages

  • individual_times - numpy array of how long ago each individual was born

  • individual_populations - numpy array of individual’s populations

All mutable properties of individuals (e.g., age) are as they were last recorded in the tree sequence: either the last time they were Remembered, or at the end of the simulation, if they are still alive then.

However, .individual_populations gives individuals’ birth populations; for their final location use the subpopulation attribute of metadata.

You can create a SlimTreeSequence using one of

Variables

reference_sequence – None, or an string of length equal to the sequence length that gives the entire reference sequence for nucleotide models.

dump(path, **kwargs)[source]

Dumps the tree sequence to the path specified. This is only a wrapper tskit.TreeSequence.dump().

Parameters
  • path (str) – The file path to write the TreeSequence to.

  • kwargs – Additional keyword args to pass to tskit.TreeSequence.dump

first_generation_individuals()[source]

Returns the IDs of the individuals remembered as part of the first SLiM generation, as determined by their flags.

Warning

This method is deprecated, because from SLiM version 3.5 the first generation individuals are no longer marked as such: only tree sequences from older versions of SLiM will have these individuals.

has_individual_parents()[source]

Finds which individuals have both their parent individuals also present in the tree sequence, as far as we can tell. To do this, we return a boolean array with True for those individuals for which:

  • all edges terminating in that individual’s nodes are in individuals,

  • each of the individual’s nodes inherit from a single individual only,

  • those parental individuals were alive when the individual was born,

  • the parental individuals account for two whole genomes.

This returns a boolean array indicating for each individual whether all these are true. Note in particular that individuals with only one recorded parent are not counted as “having parents”.

See individuals_alive_at() for further discussion about how this is determined based on when the individuals were Remembered.

Returns

A boolean array of length equal to targets.

individual(id_)[source]

Returns the individual whose ID is given by id_, as documented in tskit.TreeSequence.individual(), but with additional attributes: time, pedigree_id, age, slim_population, sex, and slim_flags. The time and population properties are extracted from the nodes, and an error will be thrown if the individual’s nodes derive from more than one population or more than one time.

Parameters

id (int) – The ID of the individual (i.e., its index).

individual_ages_at(time, stage='late', remembered_stage='late')[source]

Returns the ages of each individual at the corresponding time ago, which will be nan if the individual is either not born yet or dead. This is computed as the time ago the individual was born (found by the time associated with the the individual’s nodes) minus the time argument; while “death” is inferred from the individual’s age, recorded in metadata. These values are the same as what would be shown in SLiM during the corresponding time step and stage.

Since age increments at the end of each time step, the age is the number of time steps ends the individual has lived through, so if they were born in time step time, then their age will be zero.

In a WF model, this method does not provide any more information than does individuals_alive_at(), but for consistency, non-nan ages will be 0 in “late” and 1 in “early”. See individuals_alive_at() for further discussion.

Parameters
  • time (float) – The reference time ago.

  • stage (str) – The stage in the SLiM life cycle used to determine who is alive (either “early” or “late”; defaults to “late”).

  • remembered_stage (str) – The stage in the SLiM life cycle during which individuals were Remembered.

individual_parents()[source]

Finds all parent-child relationships in the tree sequence (as far as we can tell). The output will be a two-column array with row [i,j] indicating that individual i is a parent of individual j. See has_individual_parents() for exactly which parents are returned.

See individuals_alive_at() for further discussion about how this is determined based on when the individuals were Remembered.

Returns

An array of individual IDs, with row [i, j] if individual i is a parent of individual j.

individuals_alive_at(time, stage='late', remembered_stage=None, population=None, samples_only=False)[source]

Returns an array giving the IDs of all individuals that are known to be alive at the given time ago. This is determined using their birth time ago (given by their time attribute) and, for nonWF models, their age attribute (which is equal to their age at the last time they were Remembered). See also individual_ages_at().

In WF models, birth occurs after “early()”, so that individuals are only alive during “late()” for the time step when they have age zero, while in nonWF models, birth occurs before “early()”, so they are alive for both stages.

In both WF and nonWF models, mortality occurs between “early()” and “late()”, so that individuals are last alive during the “early()” stage of the time step of their final age, and if individuals are alive during “late()” they will also be alive during “early()” of the next time step. This means it is important to know during which stage individuals were Remembered - for instance, if the call to sim.treeSeqRememberIndividuals() was made during “early()” of a given time step, then those individuals might not have survived until “late()” of that time step. Since SLiM does not record the stage at which individuals were Remembered, you can specify this by setting remembered_stages: it should be the stage during which all calls to sim.treeSeqRememberIndividuals(), as well as to sim.treeSeqOutput(), were made.

Note also that in nonWF models, birth occurs before “early()”, so the possible parents in a given time step are those that are alive in “early()” and have age greater than zero, or, equivalently, are alive in “late()” during the previous time step. In WF models, birth occurs after “early()”, so possible parents in a given time step are those that are alive during “early()” of that time step or are alive during “late()” of the previous time step.

Parameters
  • time (float) – The number of time steps ago.

  • stage (str) – The stage in the SLiM life cycle that we are inquiring about (either “early” or “late”; defaults to “late”).

  • remembered_stage (str) – The stage in the SLiM life cycle during which individuals were Remembered (defaults to the stage the tree sequence was recorded at, stored in metadata).

  • population (int) – If given, return only individuals in the population(s) with these population ID(s).

  • samples_only (bool) – Whether to return only individuals who have at least one node marked as samples.

classmethod load(path, legacy_metadata=False)[source]

Load a SlimTreeSequence from a .trees file on disk.

Parameters

path (string) – The path to a .trees file.

Rtype SlimTreeSequence

classmethod load_tables(tables, **kwargs)[source]

Creates the SlimTreeSequence defined by the tables.

Parameters

tables (TableCollection) – A set of tables, as produced by SLiM or by annotate_defaults().

Rtype SlimTreeSequence

mutation(id_)[source]

Returns the mutation whose ID is given by id_, as documented in tskit.TreeSequence.mutation(), but with additional attributes: mutation_type, selection_coeff, population, slim_time, and nucleotide. These are all recorded by SLiM in the metadata.

Parameters

id (int) – The ID of the mutation (i.e., its index).

mutation_at(node, position, time=None)[source]

Finds the mutation present in the genome of node at position, returning -1 if there is no such mutation recorded in the tree sequence. Warning: if node is not actually in the tree sequence (e.g., not ancestral to any samples) at position, then this function will return -1, possibly erroneously. If time is provided, returns the last mutation at position inherited by node that occurred at or before time ago.

Parameters
  • node (int) – The index of a node in the tree sequence.

  • position (float) – A position along the genome.

  • time (int) – The time ago that we want the nucleotide, or None, in which case the time of node is used.

Returns

Index of the mutation in question, or -1 if none.

node(id_)[source]

Returns the node whose ID is given by id_, as documented in tskit.TreeSequence.node(), but with additional attributes: slim_id, is_null, and genome_type. These are all recorded by SLiM in the metadata.

Parameters

id (int) – The ID of the node (i.e., its index).

nucleotide_at(node, position, time=None)[source]

Finds the nucleotide present in the genome of node at position. Warning: if node is not actually in the tree sequence (e.g., not ancestral to any samples) at position, then this function will return the reference sequence nucleotide, possibly erroneously. If time is provided, returns the last nucletide produced by a mutation at position inherited by node that occurred at or before time ago.

Parameters
  • node (int) – The index of a node in the tree sequence.

  • position (float) – A position along the genome.

  • time (int) – The time ago that we want the nucleotide, or None, in which case the time of node is used.

Returns

Index of the nucleotide in NUCLEOTIDES (0=A, 1=C, 2=G, 3=T).

population(id_)[source]

Returns the population whose ID is given by id_, as documented in tskit.TreeSequence.population(), but with additional attributes: slim_id, selfing_fraction, female_cloning_fraction, male_cloning_fraction, sex_ratio, bounds_x0, bounds_x1, bounds_y0, bounds_y1, bounds_z0, bounds_z1, and migration_records. These are all recorded by SLiM in the metadata.

Note that SLiM populations are usually indexed starting from 1, but in tskit from zero, so there may be populations (e.g., with id_=0) that have no metadata and were not created by SLiM.

Parameters

id (int) – The ID of the population (i.e., its index).

recapitate(recombination_rate=None, population_configurations=None, recombination_map=None, **kwargs)[source]

Returns a “recapitated” tree sequence, by using msprime to run a coalescent simulation from the “top” of this tree sequence, i.e., allowing any uncoalesced lineages to coalesce.

Warning

This method is deprecated: please use pyslim.recapitate( ) instead.

To allow recapitation to be done correctly, the nodes of the first generation of the SLiM simulation from whom all samples inherit are still present in the tree sequence, but are not marked as samples. If you simplify the tree sequence before recapitating you must ensure these are not removed, which you do by passing the argument keep_input_roots=True to simplify().

Note that Ne is not set automatically, so defaults to 1.0; you probably want to set it explicitly. Similarly, migration is not set up automatically, so that if there are uncoalesced lineages in more than one population, you will need to pass in a demography that allows coalescence.

In general, all defaults are whatever the defaults of {ref}`msprime.simulate` are; this includes recombination rate, so that if neither recombination_rate or a recombination_map are provided, there will be no recombination.

Parameters
  • recombination_rate (float) – A (constant) recombination rate, in units of crossovers per nucleotide per unit of time.

  • population_configurations (list) – See msprime.simulate() for this argument; if not provided, each population will have zero growth rate and the same effective population size.

  • recombination_map (:class`msprime.RecombinationMap`) – The recombination map, or None, if recombination_rate is specified.

  • kwargs (dict) – Any other arguments to msprime.simulate().

simplify(*args, **kwargs)[source]

This is a wrapper for tskit.TreeSequence.simplify(). The only difference is that this method returns the derived class SlimTreeSequence.

If you have not yet recapitated your SlimTreeSequence, you probably want to pass keep_input_roots=True, so that recapitation is possible in the future.

Rtype SlimTreeSequence

property slim_provenance

Returns model type, slim generation, and remembered node count from the last entry in the provenance table that is tagged with “program”=”SLiM”.

NOTE: you probably want to use the .metadata property instead.

Rtype ProvenanceMetadata

property slim_provenances

Returns model type, slim generation, and remembered node count from all entries in the provenance table that are tagged with “program”=”SLiM”.

Rtype ProvenanceMetadata

slim_time(time, stage='late')[source]

Converts the given “tskit times” (i.e., in units of time before the end of the simulation) to SLiM times (those recorded by SLiM, usually in units of generations since the start of the simulation). Although the latter are always integers, these will not be if the provided times are not integers.

When the tree sequence is written out, SLiM records the value of its current generation, which can be found in the metadata: ts.metadata['SLiM']['generation']. In most cases, the “SLiM time” referred to by a time ago in the tree sequence (i.e., the value that would be reported by sim.generation within SLiM at the point in time thus referenced) can be obtained by subtracting that time ago from ts.metadata['SLiM']['generation']. However, in WF models, birth happens between the “early()” and “late()” stages, so if the tree sequence was written out using sim.treeSeqOutput() during “early()” in a WF model, the tree sequence’s times measure time before the last set of individuals are born, i.e., before SLiM time step ts.metadata['SLiM']['generation'] - 1.

In some situations (e.g., mutations added during early() in WF models) this may not return what you expect. See Converting from SLiM time to tskit time for more discussion.

Parameters
  • time (array) – An array of times to be converted.

  • stage (string) – The stage of the SLiM life cycle that the SLiM time should be computed for.

Metadata

SLiM-specific metadata is made visible to the user by .metadata properties. For instance:

ts.node(4).metadata
{'slim_id': 982740, 'is_null': False, 'genome_type': 0}

shows that the fifth node in the tree sequence was given pedigree ID 982740 by SLiM, is not a null genome, and has genome_type zero, which corresponds to an autosome (see below).

Annotation

These two functions will add default SLiM metadata to a tree sequence (or the underlying tables), which can then be modified and loaded into SLiM.

pyslim.annotate_defaults(ts, **kwargs)[source]

Takes a tree sequence (as produced by msprime, for instance), and adds in the information necessary for SLiM to use it as an initial state, filling in mostly default values. Returns a SlimTreeSequence.

Parameters
  • ts (TreeSequence) – A TreeSequence.

  • model_type (string) – SLiM model type: either “WF” or “nonWF”.

  • slim_generation (int) – What generation number in SLiM correponds to time=0 in the tree sequence.

  • reference_sequence (str) – A reference sequence of length equal to ts.sequence_length.

  • annotate_mutations (bool) – Whether to replace mutation metadata with defaults. (If False, the mutation table is unchanged.)

pyslim.annotate_defaults_tables(tables, model_type, slim_generation, reference_sequence=None, annotate_mutations=True)[source]

Does the work of annotate_defaults(), but modifies the tables in place: so, takes tables as produced by msprime, and makes them look like the tables as output by SLiM. See annotate_defaults() for details.

Provenances

class pyslim.ProvenanceMetadata(model_type, slim_generation, file_version)[source]

DEPRECATED: use the tree sequence metadata (under the ‘SLiM’ key) instead.

pyslim.get_provenance(ts, only_last=True)[source]

Extracts model type and slim generation from either the last entry in the provenance table that is tagged with “program”=”SLiM” (if only_last=True) or a list of all of them (otherwise).

Parameters
  • ts (SlimTreeSequence) – The tree sequence or table collection.

  • only_last (bool) – Whether to return only the last SLiM provenance entry, (otherwise, returns a list of all SLiM entries).

Rtype ProvenanceMetadata

Constants and flags

NUCLEOTIDES == ['A', 'C', 'G', 'T']

Nucleotide states in nucleotide models are encoded as integers (0, 1, 2, 3), so a nucleotide encoded as k refers to nucleotide pyslim.NUCLEOTIDES[k].

These flags are the possible values for node.metadata["genome_type"]:

GENOME_TYPE_AUTOSOME == 0
GENOME_TYPE_X == 1
GENOME_TYPE_Y == 2

These flags are the possible values for individual.metadata["sex"]:

INDIVIDUAL_TYPE_HERMAPHRODITE == -1
INDIVIDUAL_TYPE_FEMALE == 0
INDIVIDUAL_TYPE_MALE == 1

This is a flag used in individual.metadata["flags"]:

INDIVIDUAL_FLAG_MIGRATED == 0x01

Finally, these are used in individual.flags:

INDIVIDUAL_ALIVE == 2**16
INDIVIDUAL_REMEMBERED == 2**17
INDIVIDUAL_RETAINED == 2**18