Genetic value#

This page will describe how to simulate genetic values by using genetic_value() function. Genetic values of individuals are described by \(X\beta\), where \(X\) is the matrix that describes the number of causal alleles in each individual and \(\beta\) is the vector of effect sizes. See Phenotype Model for mathematical details on the quantitative trait model.

Learning Objectives

After this genetic value page, you will be able to:

  • Understand how to generate genetic value in tstrait

  • Understand how to use the user’s defined effect sizes to generate genetic values

Algorithm Overview#

Genetic value is computed in tstrait by using the trait information in the input trait dataframe. It uses a tree traversal algorithm to count the number of causal allele in each individual and adds the corresponding effect size to individual’s genetic value.

Input#

The input of genetic_value() function is the following:

ts

Tree sequence input.

trait_df

Trait dataframe that is described in Effect Size Simulation. There are some requirements for the trait dataframe input, which are described in Dataframe requirements.

The details of the parameters and how they influence the genetic value simulation are described in detail below.

Genetic value computation#

In this example, we will be showing how to compute genetic values by using the simulated trait dataframe in genetic_value().

We will be simulating a sample tree sequence data with 5 individuals by using msprime, and we will simulate a trait dataframe with 3 causal sites. The dataframe output of sim_trait() can be automatically used as an input of genetic_value(), so there is no need to worry about input data requirements.

See also

  • msprime for simulating whole genome in tree sequence data.

  • for simulating trait dataframe in tstrait.

import msprime
import tstrait

ts = msprime.sim_ancestry(
    samples=5,
    recombination_rate=1e-8,
    sequence_length=100_000,
    population_size=10_000,
    random_seed=100,
)
ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=101)
model = tstrait.trait_model(distribution="normal", mean=0, var=1)
trait_df = tstrait.sim_trait(ts, num_causal=3, model=model, random_seed=5)
trait_df
position site_id effect_size causal_allele allele_freq trait_id
0 1901 3 0.420445 G 0.2 0
1 69303 89 1.136047 G 0.6 0
2 82287 107 0.109706 A 0.1 0

Please refer to Effect Size Simulation for details on the trait dataframe output and effect size simulation by using sim_trait().

Next, we will be using the trait dataframe to generate genetic values by using genetic_value().

genetic_df = tstrait.genetic_value(ts, trait_df)
genetic_df
trait_id individual_id genetic_value
0 0 0 2.692538
1 0 1 2.381799
2 0 2 0.420445
3 0 3 1.136047
4 0 4 1.136047

The output of genetic_value() is a pandas.DataFrame object that describes the genetic values of individuals.

The genetic value dataframe includes the following columns:

  • trait_id: Trait ID that will be used in multi-trait simulation.

  • individual_id: Individual ID inside the tree sequence input.

  • genetic_value: Simulated genetic values.

User defined trait dataframe#

We will be demonstrating how to simulate genetic values from a user defined trait dataframe.

Dataframe requirements#

There are some requirements for the trait dataframe input:

Columns#

The following columns must be included in trait_df:

  • site_id: Site IDs that have causal allele.

  • effect_size: Simulated effect size of causal allele.

  • causal_allele: Causal allele.

  • trait_id: Trait ID.

Note

The IDs in the trait_id column must start from 0 and be a consecutive integer. The IDs in the site_id column must be non-decreasing.

If you are simulating single traits, trait_id must be an array of zeros. The method pandas.DataFrame.sort_values() can be used to sort the dataframe based on site_id.

Data requirements#

  • Site IDs in site_id column must be sorted in an ascending order. Please refer to pandas.DataFrame.sort_values() for details on sorting values in a pandas.DataFrame.

  • Trait IDs in trait_id column must start from zero and be consecutive.

We will be demonstrating how to simulate genetic values with 1 causal site.

import pandas as pd

data = {"site_id": [0], "effect_size": [1], "trait_id": [0],
        "causal_allele": ["1"]}
trait_df = pd.DataFrame(data)
trait_df
site_id effect_size trait_id causal_allele
0 0 1 0 1
ts = msprime.sim_ancestry(
    samples=5,
    recombination_rate=1e-8,
    sequence_length=100_000,
    population_size=10_000,
    random_seed=100,
)
ts = msprime.sim_mutations(ts, rate=1e-6, model="binary", random_seed=10)
genetic_df = tstrait.genetic_value(ts, trait_df)
genetic_df
trait_id individual_id genetic_value
0 0 0 0.0
1 0 1 1.0
2 0 2 0.0
3 0 3 0.0
4 0 4 0.0

As seen in the genetic value dataframe, the genetic values of individuals are multiples of 1. This is because 1 is the only effect size in the effect size dataframe, and the genetic_value() function is counting the number of causal alleles of site 0 inside the individual.

Next, we will be inputting an effect size dataframe with multiple causal sites.

data = {"site_id": [0, 2, 4], "effect_size": [1, 10, 100],
        "causal_allele": ["1", "1", "1"], "trait_id": [0, 0, 0]}
trait_df = pd.DataFrame(data)
trait_df = trait_df.sort_values(by=["site_id"])
trait_df
site_id effect_size causal_allele trait_id
0 0 1 1 0
1 2 10 1 0
2 4 100 1 0

The site IDs in "site_id" column in the trait dataframe must be sorted in an ascending order, so it is suggested to use .sort_values(by=['site_id']) code before inputting it inside the genetic_value() function.

genetic_df = tstrait.genetic_value(ts, trait_df)
genetic_df.head()
trait_id individual_id genetic_value
0 0 0 0.0
1 0 1 1.0
2 0 2 10.0
3 0 3 100.0
4 0 4 0.0

Numericalisation of Genotypes#

The genotypes are numericalised as the number of causal alleles in each individual (Please see Phenotype Model for mathematical details on the phenotype model), but it would be possible to change the numericalisation by modifying the genetic value dataframe based on the effect size dataframe. For example, in the diploid setting, if you are interested in simulating phenotypes from the genotype \((aa=-1, Aa=0, AA=1)\), where \(A\) is the causal allele, we can simply subtract the sum of effect sizes from the genetic value.

In the following, we will provide a quick example on how genetic values are being computed based on the mutation information and effect sizes. We will assume that the A allele represents the causal allele in site 1 and the B allele represents the causal allele in site 2. The effect sizes will be encoded as \(\beta_1\) and \(\beta_2\) for sites 1 and 2, respectively. The genotype and genetic value of individuals based on the tstrait’s numericalisation of genotypes \((aa=0, Aa=1, AA=2)\) are shown in the table below:

Individual ID

Site 1

Site 2

Genetic Value

1

Aa

BB

\(\beta_1+2\beta_2\)

2

aa

Bb

\(\beta_2\)

If we modify the numericalisation of genotypes to be \((aa=-1, Aa=0, AA=1)\), we get the following:

Individual ID

Site 1

Site 2

Genetic Value

1

Aa

BB

\(\beta_2\)

2

aa

Bb

\(-\beta_1\)

When we compare these outputs, we see that the genetic value of individuals in the new encoding \((aa=-1, Aa=0, AA=1)\) is obtained by subtracting the sum of effect sizes \((\beta_1+\beta_2)\) from the original genetic value.

This can be done in the following example:

trait_df = tstrait.sim_trait(ts, num_causal=3, model=model, random_seed=5)
genetic_df = tstrait.genetic_value(ts, trait_df)

# The original dataframe
genetic_df.head()
trait_id individual_id genetic_value
0 0 0 0.639858
1 0 1 1.976937
2 0 2 2.802245
3 0 3 3.112984
4 0 4 3.222690
genetic_df["genetic_value"] = genetic_df["genetic_value"] - trait_df["effect_size"].sum()
# New dataframe
genetic_df.head()
trait_id individual_id genetic_value
0 0 0 -1.026340
1 0 1 0.310739
2 0 2 1.136047
3 0 3 1.446785
4 0 4 1.556492

The new genetic value dataframe can be used in sim_env() to simulate phenotypes.

Normalise Genetic Value#

The computed genetic values can be scaled by using the normalise_genetic_value() function. The function will first normalise the genetic value by subtracting the mean of the input genetic value and divide it by the standard devitation of the input genetic value. Afterwards, it scales the normalised genetic value based on the mean and variance input. The output of normalise_genetic_value() is a pandas.DataFrame object with the scaled genetic values. It is suggested to use this function on the genetic value dataframe that is obtained by genetic_value(), and use the normalised genetic value dataframe to simulate phenotypes by using sim_env().

An example usage of this function is shown below:

mean = 0
var = 1
ts = msprime.sim_ancestry(
    samples=10_000,
    recombination_rate=1e-8,
    sequence_length=100_000,
    population_size=10_000,
    random_seed=1000,
)
ts = msprime.sim_mutations(ts, rate=1e-7, random_seed=1001)
trait_df = tstrait.sim_trait(ts, num_causal=1000, model=model, random_seed=500)
genetic_df = tstrait.genetic_value(ts, trait_df)
normalised_df = tstrait.normalise_genetic_value(genetic_df, mean=mean, var=var)
normalised_df.head()
individual_id trait_id genetic_value
0 0 0 0.651249
1 1 0 -0.530608
2 2 0 0.534444
3 3 0 -0.637727
4 4 0 -1.807479

The distribution of the normalised genetic value is shown below, and wee that the mean and variance of the normalised genetic values are 0 and 1.

import matplotlib.pyplot as plt
plt.hist(normalised_df["genetic_value"], bins=40)
plt.title("Normalised Genetic Value")
plt.show()
_images/61acb11ce159a12c0eb51138f193c5eceb420c85b6d60fb7977b4714c1a4e24f.png