Skip to main content
Ctrl+K
CSE284 - Personal Genomics - Home CSE284 - Personal Genomics - Home
  • Personal Genomics for Bioinformaticians

Module 1 - Introduction

  • Chapter 1: Introduction to the human genome
    • 1.1 The structure of DNA
    • 1.2 Organization of the human genome
    • 1.3 Genes and the central dogma
    • 1.4 The non-coding genome
    • 1.5 Genetic variation terminology
  • Chapter 2: Overview of technologies for genome analysis
    • 2.1 Genotyping arrays
    • 2.2 Next-generation sequencing
    • 2.3 Long-read technologies
    • 2.4 Summary of genomics technologies
  • Tutorial 1: file formats for describing genetic variation
    • T1.2 VCF files
    • T1.2 Plink files

Module 2 - Ancestry

  • Chapter 3: Introduction to population genetics
    • 3.1 The four forces of evolution
    • 3.2 Types and patterns of mutations
    • 3.3 Genetic drift
    • 3.4 Selection
    • 3.5 Hardy-Weinberg Equilibrium and random mating
    • 3.6 Measuring population differentiation with Fst
    • 3.7 Recombination
    • Activity: Exploring the Wright-Fisher model of genetic drift
  • Chapter 4: Global ancestry
    • 4.1 Global ancestry analysis with PCA
    • 4.2 Global ancestry analysis with ADMIXTURE
  • Chapter 5: Relative finding
    • 5.1 Identity by descent
    • 5.2 Expected IBD of close relatives
    • 5.3 Computing expected IBD - plink method
    • 5.4 Computing expected IBD segment sharing
  • Tutorial 2: Working with 1000 Genomes data and plink
    • T2.1 The 1000 Genomes Project dataset (and wrangling VCF files)
    • T2.2 Some tips and examples for using plink

Module 3 - From genotypes to phenotypes

  • Chapter 7: GWAS for complex traits
    • 7.1 GWAS for quantitative traits
    • 7.2 Exploring GWAS association statistics
    • 7.3 Confounding factors in GWAS
    • 7.4 GWAS for case-control traits
    • 7.5 Linear mixed models
    • 7.6 Power to detect associations
  • Chapter 8: Heritability
    • 8.1 Measuring heritability in related individuals
    • 8.2 Measuring SNP-based heritability using LMMs
  • Chapter 9: Polygenic risk scores
    • 9.1 Clumping + Threshold (C+T) method
    • 9.2 Bayesian methods for polygenic risk scores

Appendix 1 - glossary and notation

  • Chapter A1.1: Unix cheat sheet
  • Repository
  • Open issue
  • .md

T1.2 Plink files

Contents

  • Plink text files (fam/ped/map)
  • Plink binary files (bed/bim/fam)
  • Plink pgen format (pgen/pvar/psam)

T1.2 Plink files#

You will also come across genotypes in the format used by Plink (https://www.cog-genomics.org/plink/1.9/formats), which can perform many different functions, including filter, association testing, IBD calculation, and more.

Plink text files (fam/ped/map)#

The text version of plink files includes:

  • FAM (sample info): A text file with no header line, and one line per sample with the following six fields:

    • Family ID (‘FID’)

    • Within-family ID (‘IID’; cannot be ‘0’)

    • Within-family ID of father (‘0’ if father isn’t in dataset)

    • Within-family ID of mother (‘0’ if mother isn’t in dataset)

    • Sex code (‘1’ = male, ‘2’ = female, ‘0’ = unknown)

    • Phenotype value (‘1’ = control, ‘2’ = case, ‘-9’/’0’/non-numeric = missing data if case/control)

  • PED (genotypes): Contains no header line, and one line per sample with 2V+6 fields where V is the number of variants. The first six fields are the same as those in a .fam file.

  • MAP (variants info) : A text file with no header file, and one line per variant with the following 3-4 fields:

    • Chromosome code. PLINK 1.9 also permits contig names here, but most older programs do not.

    • Variant identifier

    • Position in morgans or centimorgans (optional; also safe to use dummy value of ‘0’)

    • Base-pair coordinate

Plink binary files (bed/bim/fam)#

Plink files may be compressed into binary formats. The binary versions of these files are bed/bim files. Bed files are not human readable but can be converted back to ped/map files.

You can find example plink data in: ~/public/ps2/:

When running plink, you will almost always use one of these options:

  • --bfile <prefix>: uses <prefix>.bed and <prefix>.bim as input

  • --file <prefix>: uses <prefix>.ped, <prefix>.map, and <prefix>.fam as input

You will probably eventually encounter the need to convert things between VCF/plink formats. Plink can do that:

# VCF->plink
echo rs112607901 > exclude.txt # this ID was duplicated
plink \
  --vcf pset1_1000Genomes_chr16.vcf.gz \
  --recode  \
  --exclude exclude.txt \
  --out pset1_1000Genomes_chr16

# Plink->VCF
# Note, plink may change the allele order if the major allele
# is not the reference. We use the --a2-allele and 
# --real-ref-alleles options below to force it to correctly
# set ref/alt in the output VCF file
zcat pset1_1000Genomes_chr16.vcf.gz | grep -v "^#" | cut -f 1-5 > gtdata_alleles.tab
plink \
  --file pset1_1000Genomes_chr16 \
  --recode vcf bgz \
  --a2-allele gtdata_alleles.tab 4 3 '#' \
  --real-ref-alleles \
  --exclude exclude.txt \
  --out pset1_1000Genomes_chr16_converted

Plink pgen format (pgen/pvar/psam)#

Plink2 has introduced a new binary format, pgen, which shows better compute performance especially on recent massive biobank datasets. It also has better support for phased, multi-allelic, and dosage data.

To specify pgen input, instead of using --bfile or --file, you can use --pfile <PREFIX>, which looks for files <PREFIX>.pgen, <PREFIX>.pvar, and <PREFIX>.psam.

previous

T1.2 VCF files

next

Chapter 3: Introduction to population genetics

Contents
  • Plink text files (fam/ped/map)
  • Plink binary files (bed/bim/fam)
  • Plink pgen format (pgen/pvar/psam)

By Melissa Gymrek

© Copyright 2023.