Skip to main content
Ctrl+K
CSE284 - Personal Genomics - Home CSE284 - Personal Genomics - Home
  • Personal Genomics for Bioinformaticians

Module 1 - Introduction

  • Chapter 1: Introduction to the human genome
    • 1.1 The structure of DNA
    • 1.2 Organization of the human genome
    • 1.3 Genes and the central dogma
    • 1.4 The non-coding genome
    • 1.5 Genetic variation terminology
  • Chapter 2: Overview of technologies for genome analysis
    • 2.1 Genotyping arrays
    • 2.2 Next-generation sequencing
    • 2.3 Long-read technologies
    • 2.4 Summary of genomics technologies
  • Tutorial 1: file formats for describing genetic variation
    • T1.2 VCF files
    • T1.2 Plink files

Module 2 - Ancestry

  • Chapter 3: Introduction to population genetics
    • 3.1 The four forces of evolution
    • 3.2 Types and patterns of mutations
    • 3.3 Genetic drift
    • 3.4 Selection
    • 3.5 Hardy-Weinberg Equilibrium and random mating
    • 3.6 Measuring population differentiation with Fst
    • 3.7 Recombination
    • Activity: Exploring the Wright-Fisher model of genetic drift
  • Chapter 4: Global ancestry
    • 4.1 Global ancestry analysis with PCA
    • 4.2 Global ancestry analysis with ADMIXTURE
  • Chapter 5: Relative finding
    • 5.1 Identity by descent
    • 5.2 Expected IBD of close relatives
    • 5.3 Computing expected IBD - plink method
    • 5.4 Computing expected IBD segment sharing
  • Tutorial 2: Working with 1000 Genomes data and plink
    • T2.1 The 1000 Genomes Project dataset (and wrangling VCF files)
    • T2.2 Some tips and examples for using plink

Module 3 - From genotypes to phenotypes

  • Chapter 7: GWAS for complex traits
    • 7.1 GWAS for quantitative traits
    • 7.2 Exploring GWAS association statistics
    • 7.3 Confounding factors in GWAS
    • 7.4 GWAS for case-control traits
    • 7.5 Linear mixed models
    • 7.6 Power to detect associations
  • Chapter 8: Heritability
    • 8.1 Measuring heritability in related individuals
    • 8.2 Measuring SNP-based heritability using LMMs
  • Chapter 9: Polygenic risk scores
    • 9.1 Clumping + Threshold (C+T) method
    • 9.2 Bayesian methods for polygenic risk scores

Appendix 1 - glossary and notation

  • Chapter A1.1: Unix cheat sheet
  • Repository
  • Open issue
  • .md

T2.2 Some tips and examples for using plink

Contents

  • The anatomy of a plink command
    • Specifying input genotypes
      • Converting between genotype formats
      • Caveats when dealing with VCF files
  • Example plink use case - LD pruning

T2.2 Some tips and examples for using plink#

Plink contains implementation of many different methods and utilities that can be performed on genetic data. There are many different options, which can make it a bit confusing if you’ve never used it before. This brief tutorial goes over some plink basics, and through some example use cases.

Note: when using plink, things will run much faster if you use the high memory instances on datahub.

The anatomy of a plink command#

While there are many different ways to use plink, almost all commands will take the following form:

plink \
  <option to specify input genotypes> \
  --out outprefix \
  [command]
  [other options]

That is, you must specify:

  • The path to the genotype information for your samples. This argument can take multiple possible forms (see below)

  • A prefix to name output files. e.g. --out prefix will name files prefix.log, etc. The exact output files depend on the command used.

  • A command to run. e.g. --pca or --genome. Every run has to have some kind of command, even if it is just a command to output a certain file format, like --make-bed or --recode.

Finally, you may wish to specify other options, e.g. to restrict analysis to a certain chromosome, filter SNPs by MAF, etc.

Below we talk more about these different arguments.

Specifying input genotypes#

For almost any use case, plink will need access to the actual genotypes (e.g. 0s, 1s, and 2s for each variant) for one or more samples. There are multiple ways to provide genotypes as input to plink. These include:

  • --vcf <file.vcf>: Plink can directly read VCF files. They can be either bgzipped or not. This is convenient since many big datasets (e.g. 1000Genomes) are in VCF format already. But there are some caveats (see below).

  • --file <prefix>: The original plink format has several files per dataset: $prefix.ped gives family/sample information and genotype calls, and $prefix.map specifies the locations, alleles, and IDs of the SNPs. --file prefix tells plink to look for prefix.map and prefix.ped.

  • --bfile <prefix>: This is a binary (compressed) version of the original plink format. Plink will look for prefix.bed (compressed genotypes) and prefix.bim (SNP info in a simple text file)

Plink handles many other formats, but all of our data for the course will come in one of the formats above. Read more about plink file formats here: https://www.cog-genomics.org/plink/1.9/formats.

Converting between genotype formats#

You can use plink to convert genotypes from one format to another. e.g.:

VCF=~/public/1000Genomes/ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

# Convert VCF to ped/map
# (using only couple of variants to make this smaller)
plink \
 --vcf $VCF \
 --recode \
 --out test \
 --from rs559462325 --to rs565663130

# Convert ped/map to bed/bim
plink \
 --file test \
 --make-bed \
 --out test

# Convert bed/bim to VCF
plink \
 --bfile test \
 --recode vcf bgz \
 --out test

Caveats when dealing with VCF files#

Plink currently loses some information present in VCF files, which can be a common source of errors and confusion. You shouldn’t need to deal with most of these things (except the --double-id issue) in your problem sets but they are good to be aware of:

  • It doesn’t properly deal with multi-allelic variants (https://www.cog-genomics.org/plink/1.9/data#merge3)

  • Plink traditionally included both family and sample IDs for each sample. But VCFs only have sample IDs. To deal with this, you’ll have to use the --double-id option if you want to extract sets of samples. See the 1000 Genomes tutorial from week 1.

  • If you convert from VCF to another plink format, it order alleles based on minor vs. major allele. This can be problematic and confusing, since VCFs have a strict notion of reference allele (what is in the reference genome) and alternate allele. See option --keep-allele-order.

  • Phase data is ignored (and orders might be switched in output files).

Example plink use case - LD pruning#

Some analyses, such as PCA or running ADMIXTURE, assume we have a set of SNPs in linkage equilibrium. In those cases, it is common to “prune” SNPs to remove highly correlated SNPs from analysis. See the example below for steps to do this.

VCF=~/public/1000Genomes/ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

# Get a list of independent SNPs
# --indep-pairwise <window size> <step size> <r2 threshold>
# Window size = 50 variants
# Step size = variant count to shift the window at the end of each step
# r2 threshold = 0.1. at each step, pairs of variants in the current window with 
# squared correlation greater than the threshold are noted, and variants are 
# greedily pruned from the window until no such pairs remain
# Outputs prune-example.prune.in (variants to keep) and prune-example.prune.out (variants to prune)
plink --vcf $VCF --indep-pairwise 50 10 0.1 --out prune-example

# Extract a new dataset with only those SNPs
plink --vcf $VCF --extract prune-example.in --out prune-example.pruned --make-bed

previous

T2.1 The 1000 Genomes Project dataset (and wrangling VCF files)

next

Chapter 7: GWAS for complex traits

Contents
  • The anatomy of a plink command
    • Specifying input genotypes
      • Converting between genotype formats
      • Caveats when dealing with VCF files
  • Example plink use case - LD pruning

By Melissa Gymrek

© Copyright 2023.