Skip to main content
Ctrl+K
CSE284 - Personal Genomics - Home CSE284 - Personal Genomics - Home
  • Personal Genomics for Bioinformaticians

Module 1 - Introduction

  • Chapter 1: Introduction to the human genome
    • 1.1 The structure of DNA
    • 1.2 Organization of the human genome
    • 1.3 Genes and the central dogma
    • 1.4 The non-coding genome
    • 1.5 Genetic variation terminology
  • Chapter 2: Overview of technologies for genome analysis
    • 2.1 Genotyping arrays
    • 2.2 Next-generation sequencing
    • 2.3 Long-read technologies
    • 2.4 Summary of genomics technologies
  • Tutorial 1: file formats for describing genetic variation
    • T1.2 VCF files
    • T1.2 Plink files

Module 2 - Ancestry

  • Chapter 3: Introduction to population genetics
    • 3.1 The four forces of evolution
    • 3.2 Types and patterns of mutations
    • 3.3 Genetic drift
    • 3.4 Selection
    • 3.5 Hardy-Weinberg Equilibrium and random mating
    • 3.6 Measuring population differentiation with Fst
    • 3.7 Recombination
    • Activity: Exploring the Wright-Fisher model of genetic drift
  • Chapter 4: Global ancestry
    • 4.1 Global ancestry analysis with PCA
    • 4.2 Global ancestry analysis with ADMIXTURE
  • Chapter 5: Relative finding
    • 5.1 Identity by descent
    • 5.2 Expected IBD of close relatives
    • 5.3 Computing expected IBD - plink method
    • 5.4 Computing expected IBD segment sharing
  • Tutorial 2: Working with 1000 Genomes data and plink
    • T2.1 The 1000 Genomes Project dataset (and wrangling VCF files)
    • T2.2 Some tips and examples for using plink

Module 3 - From genotypes to phenotypes

  • Chapter 7: GWAS for complex traits
    • 7.1 GWAS for quantitative traits
    • 7.2 Exploring GWAS association statistics
    • 7.3 Confounding factors in GWAS
    • 7.4 GWAS for case-control traits
    • 7.5 Linear mixed models
    • 7.6 Power to detect associations
  • Chapter 8: Heritability
    • 8.1 Measuring heritability in related individuals
    • 8.2 Measuring SNP-based heritability using LMMs
  • Chapter 9: Polygenic risk scores
    • 9.1 Clumping + Threshold (C+T) method
    • 9.2 Bayesian methods for polygenic risk scores

Appendix 1 - glossary and notation

  • Chapter A1.1: Unix cheat sheet
  • Repository
  • Open issue
  • .md

8.1 Measuring heritability in related individuals

Contents

  • 8.1.1 Midparent regression
  • 8.1.2 Twin studies
  • 8.1.3 Sibling pairs - Haseman Elston regression
  • 8.1.4 Note on alternative way to do HE

8.1 Measuring heritability in related individuals#

In this section we will explore several ways to measure heritability in pedigrees (related individuals).

8.1.1 Midparent regression#

A classical way to measure heritabilitly of a trait is the midparent method. This method fits the following linear equation:

\[ Y_{child} = \beta Y_{mid} + \epsilon \]

where:

  • \(Y_{child}\) is the phenotype of the child

  • \(Y_{mid}=\frac{Y_{mother}+Y_{father}}{2}\) is the average of the two parent phenotypes (“midparent”)

  • \(\beta\) is the slope of the best fit line

  • \(\epsilon\) is the error term

Note we are ignoring covariates like sex here. If this were height for example, we’d need to adjust the phenotypes for covariates first.

Using this method, the slope, \(\beta\) gives an estimate of the heritability. Intuitively, if the slope is close to 1, the trait is completely heritable, and we’d expect the child’s phenotype to be around the average of the two parent values. If heritability is 0, there is no relationship between the child and parent phenotypes. See the example data below:

midparent

Source: Nature Education - Estimating Trait Heritability

Note, even when the heritability is 1, there will still be some variation of the points along the x=y diagonal. Even in this case the child’s phenotype is not completely deterministic: it will depend which half of each parent’s genome they inherit and so there will be some variance.

To help illustrate this, consider the correlation between \(Y_{child}\) and \(Y_{mid}\):

\[ Cor(Y_{child}, Y_{mid}) = \frac{cov(Y_{child}, Y_{mid})}{\sigma(Y_{child})\sigma(Y_{mid})} \]

where \(\sigma(Y)\) gives the standard deviation. We can assume the variance of \(Y_{child}\) is similar to that of \(Y_{mother}\) and \(Y_{father}\) (all just reflect the variance of the phenotype in the population). \(Var(Y_{mother}=Y_{father}=Y_{child})\). Therefore:

\[ Var(Y_{mid}) = Var(\frac{Y_{mother}+Y_{father}}{2}) = Var(\frac{Y_{mother}}{2}) + Var(\frac{Y_{father}}{2}) = 2Var(\frac{Y_{child}}{2}) \]
\[ Var(Y_{mid}) = \frac{1}{2}Var(Y_{child}) \]

We assumed there is no covariance between \(Y_{mother}\) and \(Y_{father}\) and also used the common relationship \(Var(aX)=a^2Var(X)\). We can also write down the least squares estimate of \(\beta\):

\[ \beta = h^2 = \frac{cov(Y_{child}), Y_{mid}}{Var(Y_{mid})} = \frac{2cov(Y_{child}), Y_{mid}}{Var(Y_{child})} \]

Rearranging the above equation for the correlation and substituting our expressions for \(\beta\) and \(Var(Y_{child})\) gives:

\[ Cor(Y_{child}, Y_{mid}) = \frac{cov(Y_{child}, Y_{mid})}{\sigma(Y_{child})\sigma(Y_{mid})} = \frac{cov(Y_{child}, Y_{mid})}{\sqrt{2}Var(Y_{mid})} = \sqrt{\frac{1}{2}}h^2 \]

And so, even if the heritability is 1, the correlation bewteen the values will be less than 1 (0.71).

8.1.2 Twin studies#

The midparent method isn’t perfect. It will tend to over-estimate heritability, since there is likely a lot of shared environmental factors between parents and children. A second study design we could use to estimate heritability is by looking at identical twins, which attempts to account for this.

Consider that you collected phenotypes for many sets of monozygotic (MZ) twins. We could attribute to the correlation of phenotypes between them as being due to both their shared genomes (100%) and their shared environment:

\[ r_{MZ} = V_G + V_E \]

Now we can also collect data for a set of dizygotic (DZ) twins. DZ twins are together in utero, but genetically are the same as siblings. That is, they are expected to share around half of their genomes (remember, this is an average). However, they are still expected to share a similar amount of their environments as do MZ twins. So we can write:

\[ r_{DZ} = \frac{1}{2}V_G + V_E \]

Now we can rearrange to get:

\[ 2(r_{MZ}-r_{DZ}) = 2V_G+2V_E-V_G-2V_E = V_G \]

this is known as Falconer’s method to estimate heritability from twin studies as:

\[ h^2 = 2(r_{MZ}-r_{DZ}) \]

This does better by accounting for shared environment. However, it also makes the simplifying assumption that DZ twins share half their genomes. In reality, there will be a distribution of sharing. Also, it’s possible that MZ twins actually do share more environmental factors, if they are treated more similarly than DZ twins might be.

8.1.3 Sibling pairs - Haseman Elston regression#

Let’s look at one more method to measure heritability, this time in siblings. As mentioned above, siblings will share on average 50% of their genomes, but there is a distribution: some share more, some share less. Haseman-Elston (HE) regression looks at the values of phenotypes for pairs of siblings. Intuitively, siblings who happen to share more of their genomes IBD should have more similar phenotypes than those who happen to share less. We can use the following regression to data from many pairs of siblings:

\[ (Y_{i1}-Y_{i2})^2 = \alpha + \beta \pi_{i} + \epsilon \]

where \(Y_{i1}\) and \(Y_{i2}\) are the phenotype values for sibling pair \(i\) and \(\pi_{i}\) is the IBD sharing between the sibling pair. This line is expected to have a negative slope: if IBD is large, the phenotype difference is expected to be small, and vice versa. It turns out that the slope of this line gives us yet another way to estimate heritability:

\[ h^2 = -\frac{\beta}{2} \]

This is actually related to Falconer’s formula above. We can write:

\[ E[Corr(Y_{i1}, Y_{i2})] = \frac{Cov(Y_{i1},Y_{i2})}{Var(Y_{i1})Var(Y_{i2})} \]

Let’s assume we have standardized the phenotypes to have mean 0 and variance 1. Then the covariance is the same as the correlation and we can write:

\[ E[Corr(Y_{i1}, Y_{i2})] = E[Y_{i1}Y_{i2}]-E[Y_{i1}]E[Y_{i2}] = E[Y_{i1}Y_{i2}] \]

Now back to the HE-regression equation. We can expand the left-hand side:

\[ E[(Y_{i1}-Y_{i2})^2] = E[Y_{i1}^2] + E[Y_{i2}^2] - 2E[Y_{i1}Y_{i2}] = 2Var(Y_{i1}) - 2E[Y_{i1}Y_{i2}] = 2 - 2E[Y_{i1}Y_{i2}] \]

(since the variance of \(Y_{i1}\) and \(Y_{i2}\) are the same, and \(E[Y^2]=Var(Y)-E[Y]^2=1-0 = 1\)). We can rearrange the HE-regression formula to instead regress on the product of sibling phenotype values:

\[ Y_{i1}Y_{i2} = \alpha - \frac{\beta}{2} \pi_{i} + \epsilon \]

(Note the constant 2 just got absorbed into the \(\alpha\) intercept term and we can ignore it.) Now we can see the parallel to the formula we used for \(r_{DZ}\) above:

\[ r_{DZ} = \frac{1}{2}V_G + V_E \]
\[ E[r_{i}] = E[Y_{i1}Y_{i2}] = \pi_{i} (- \frac{\beta}{2}) + \epsilon \]

Parallels:

  • For twin studies, we measured the correlation \(r_{DZ}\). For HE regression, we measured \((Y_{i1}-Y_{i2})^2\), which we rearranged here to be related to the expected correlation between sibs.

  • For twin studies, we just set the IBD (\(\pi\)) term to 0.5. Here we instead have \(\pi_{i}\) is the actual measured IBD between sib pair \(i\).

  • The \(V_G\) term from the twin equation, which turned out to be equal to the heritability estimate, turned into \(-\frac{\beta}{2}\) here.

  • The \(V_E\) term from the twin equation is just our \(\epsilon\) here.

This matches up with what we stated above, which is that \(-\frac{\beta}{2}\) is another estimate of heritability.

8.1.4 Note on alternative way to do HE#

Note: we could have also just regressed the product of sibling phenotype values in the first place on IBD:

\[ Y_{i1}Y_{i2} = \alpha + \beta\pi_{i} + \epsilon \]

In this case, \(\beta\) directly gives us the estimate of \(h^2\). This alternate formulation appears to be a bit more commonly used than the squared difference of phenotypes way presented above.

previous

Chapter 8: Heritability

next

8.2 Measuring SNP-based heritability using LMMs

Contents
  • 8.1.1 Midparent regression
  • 8.1.2 Twin studies
  • 8.1.3 Sibling pairs - Haseman Elston regression
  • 8.1.4 Note on alternative way to do HE

By Melissa Gymrek

© Copyright 2023.