T2.1 The 1000 Genomes Project dataset (and wrangling VCF files)

T2.1 The 1000 Genomes Project dataset (and wrangling VCF files)#

Overview#

The 1000 Genomes Project dataset provides a deep catalog of human genetic variation across healthy individuals from diverse populations. Unlike nearly all other large human genetic variation datasets, the 1000 Genomes Project data is made available as a community resource. Access is available freely on the internet to any researcher who wants to use the data.

Our analyses throughout the course will focus on the “phase 3” release of the data, which contains genetic variation (VCF files) for 2,504 samples. See A global reference for human genetic variation for more details on how this dataset was generated and for a description of the main findings.

Note, this data is using the GRCh37 human reference genome build. Newer callsets have been generated using the GRCh38 human reference genome, but we will stick with the original phase 3 data here.

1000 Genomes population codes#

Samples originate from distinct population groups. Populations are broadly grouped by superpopulation, denoted by widely used 3-letter codes (EUR=Europe, SAS=South Asia, EAS=East Asia, AFR=African, AMR=American). Superpopulations can be further divided into populations with their own 3 letter codes. For example “YRI” denotes the Yoruban population from Nigeria, which falls under the “AFR” superpopulation. See a full list of population and superpopulation codes below:

EAS
---
CHB	Han Chinese
JPT	Japanese
CHS	Southern Han Chinese
CDX	Dai Chinese
KHV	Kinh Vietnamese
CHD	Denver Chinese

EUR
---
CEU	CEPH (Utah, NW Europe)
TSI	Tuscan
GBR	British
FIN	Finnish
IBS	Spanish

AFR
---
YRI	Yoruba
LWK	Luhya
GWD	Gambian
MSL	Mende
ESN	Esan
ASW	African-American SW
ACB	African-Caribbean

AMR
---
MXL	Mexican-American
PUR	Puerto Rican
CLM	Colombian
PEL	Peruvian

SAS
---
GIH	Gujarati
PJL	Punjabi
BEB	Bengali 
STU	Sri Lankan
ITU	Indian Telugu

1000 Genomes data - VCFs#

We will work with the phase 3 VCF files. These were obtained from https://www.internationalgenome.org/data. If you are working on the CSE284 datahub, you can find the VCF files (one per chromosome) here:

~/public/1000Genomes/

These were downloaded from the UCSC Genome Browser.

For each chromosome, you’ll see two files, e.g.

ALL.chr2.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
ALL.chr2.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi

The first is a bgzipped VCF file. The second is an index file, created by the tabix program. Below are some tips on looking at these files:

You can visualize a zipped file on the command line using zcat (rather than cat). To scroll through the file, e.g.:

zcat ALL.chr15.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | less -S

Once a VCF file is indexed with tabix, we can quickly extract a particular region of interest. e.g.:

tabix ALL.chr15.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz 15:28341863-28341863 | less -S

Note, you’ll see some 1000 Genomes specific fields in the INFO part of the VCF file. For example, you can see the alternate allele frequency of each variant in the different superpopulations in the fields AFR_AF, EUR_AF, etc.

1000 Genomes data - Populations#

This file contains information about the population label of each sample. It was downloaded directly from the IGSR website. If you are on the CSE284 datahub, you can find this file at:

~/public/1000Genomes/igsr_samples.tsv

Below are some helpful example commands for extracting population sets from this file:

Print a list of samples coming from the CEU population:

cat igsr_samples.tsv | \
  grep "1000 Genomes phase 3 release" | \
  awk -F"\t" '($4=="CEU")' | \
  awk '{print $1}'

Print a list of samples coming from the AFR superpopulation:

cat igsr_samples.tsv | \
  grep "1000 Genomes phase 3 release" | \
  awk -F"\t" '($6=="AFR")' | \
  awk '{print $1}'

Print a list of samples coming from the AFR superpopulation, in the format required for plink (see below):

cat igsr_samples.tsv | \
  grep "1000 Genomes phase 3 release" | \
  awk -F"\t" '($6=="AFR")' | \
  awk '{print $1 "\t" $1}'

Using 1000 Genomes data in plink#

Plink is a handy tool for doing all sorts of things with genetic data. The newer version of plink (v1.9+) supports operations directly on VCF files, as well as conversion of VCF files to other data formats supported by plink. Below shows examples of some plink operations on the 1000 Genomes VCF files.

Extract data from chr15 for the CEU population

# First, make a sample file that can be read by plink
# Historically, plink data includes family IDs and sample IDs for each sample
# We do not have family information, so will use the sample ID for both fields
# See https://www.cog-genomics.org/plink/1.9/filter#indiv
cat igsr_samples.tsv | \
  grep "1000 Genomes phase 3 release" | \
  awk -F"\t" '($4=="CEU")' | \
  awk '{print $1 "\t" $1}' > ceu_samples.txt

# Now, run plink to subset
# Option info:
# --vcf uses a VCF file as input
# --out gives the prefix of output files
# --make-bed specifies to output plink binary format (BED/BIM/FAM)
# --keep specifies to keep only the samples in the provided file
# --double-id means to use the sample ID from the VCF as the family ID 
# See https://www.cog-genomics.org/plink/1.9/input#vcf for more info
VCF=~/public/1000Genomes/ALL.chr15.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
plink --vcf $VCF \
	--out ceu_chr15 \
	--make-bed \
	--keep ceu_samples.txt \
	--double-id

Compute Hardy-Weinberg Equilibrium statistics for CEU samples SNP rs12913832

# See output file ceu_hardy.hwe
VCF=~/public/1000Genomes/ALL.chr15.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
plink --vcf $VCF \
   --hardy \
   --keep ceu_samples.txt \
   --double-id \
   --snp rs12913832 \
   --out ceu_hardy

Wrangling 1000 Genomes VCF files with bcftools#

bcftools is another package with very useful VCF wrangling capabilities. You are encouraged to check these out here: https://samtools.github.io/bcftools/bcftools.html

We put some examples here you might find helpful using the bcftools query utility, which can output information from VCF files in customized text formats, and also allows for a rich set of options to filter data by sample, position, VCF fields, etc.

Extract genotypes for a single sample (e.g. NA12878) in tab-delimited format

VCF=~/public/1000Genomes/ALL.chr15.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
bcftools query -s NA12878 -f '%CHROM\t%POS\t[%GT\n]' $VCF

Count the number of singleton variants in this sample, restricting to bi-allelic sites. A singleton is a variant for which only a single copy of the alternate allele was seen in the entire cohort. The command below uses the -i option of bcftools to restrict to sites where the number of alternate alleles is 1 and the alternate allele count is 1. See more on bcftools expressions.

bcftools query --include "N_ALT=1 && INFO/AC=1" \
  -s NA12878 -f '[%GT\n]' $VCF | \
  grep -v "0|0" | wc -l