T2.1 The 1000 Genomes Project dataset (and wrangling VCF files)#
Overview#
The 1000 Genomes Project dataset provides a deep catalog of human genetic variation across healthy individuals from diverse populations. Unlike nearly all other large human genetic variation datasets, the 1000 Genomes Project data is made available as a community resource. Access is available freely on the internet to any researcher who wants to use the data.
Our analyses throughout the course will focus on the “phase 3” release of the data, which contains genetic variation (VCF files) for 2,504 samples. See A global reference for human genetic variation for more details on how this dataset was generated and for a description of the main findings.
Note, this data is using the GRCh37 human reference genome build. Newer callsets have been generated using the GRCh38 human reference genome, but we will stick with the original phase 3 data here.
1000 Genomes population codes#
Samples originate from distinct population groups. Populations are broadly grouped by superpopulation, denoted by widely used 3-letter codes (EUR=Europe, SAS=South Asia, EAS=East Asia, AFR=African, AMR=American). Superpopulations can be further divided into populations with their own 3 letter codes. For example “YRI” denotes the Yoruban population from Nigeria, which falls under the “AFR” superpopulation. See a full list of population and superpopulation codes below:
EAS
---
CHB Han Chinese
JPT Japanese
CHS Southern Han Chinese
CDX Dai Chinese
KHV Kinh Vietnamese
CHD Denver Chinese
EUR
---
CEU CEPH (Utah, NW Europe)
TSI Tuscan
GBR British
FIN Finnish
IBS Spanish
AFR
---
YRI Yoruba
LWK Luhya
GWD Gambian
MSL Mende
ESN Esan
ASW African-American SW
ACB African-Caribbean
AMR
---
MXL Mexican-American
PUR Puerto Rican
CLM Colombian
PEL Peruvian
SAS
---
GIH Gujarati
PJL Punjabi
BEB Bengali
STU Sri Lankan
ITU Indian Telugu
1000 Genomes data - VCFs#
We will work with the phase 3 VCF files. These were obtained from https://www.internationalgenome.org/data. If you are working on the CSE284 datahub, you can find the VCF files (one per chromosome) here:
~/public/1000Genomes/
These were downloaded from the UCSC Genome Browser.
For each chromosome, you’ll see two files, e.g.
ALL.chr2.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
ALL.chr2.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi
The first is a bgzipped VCF file. The second is an index file, created by the tabix
program. Below are some tips on looking at these files:
You can visualize a zipped file on the command line using
zcat
(rather thancat
). To scroll through the file, e.g.:
zcat ALL.chr15.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | less -S
Once a VCF file is indexed with
tabix
, we can quickly extract a particular region of interest. e.g.:
tabix ALL.chr15.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz 15:28341863-28341863 | less -S
Note, you’ll see some 1000 Genomes specific fields in the INFO part of the VCF file. For example, you can see the alternate allele frequency of each variant in the different superpopulations in the fields AFR_AF
, EUR_AF
, etc.
1000 Genomes data - Populations#
This file contains information about the population label of each sample. It was downloaded directly from the IGSR website. If you are on the CSE284 datahub, you can find this file at:
~/public/1000Genomes/igsr_samples.tsv
Below are some helpful example commands for extracting population sets from this file:
Print a list of samples coming from the CEU population:
cat igsr_samples.tsv | \
grep "1000 Genomes phase 3 release" | \
awk -F"\t" '($4=="CEU")' | \
awk '{print $1}'
Print a list of samples coming from the AFR superpopulation:
cat igsr_samples.tsv | \
grep "1000 Genomes phase 3 release" | \
awk -F"\t" '($6=="AFR")' | \
awk '{print $1}'
Print a list of samples coming from the AFR superpopulation, in the format required for plink (see below):
cat igsr_samples.tsv | \
grep "1000 Genomes phase 3 release" | \
awk -F"\t" '($6=="AFR")' | \
awk '{print $1 "\t" $1}'
Using 1000 Genomes data in plink#
Plink is a handy tool for doing all sorts of things with genetic data. The newer version of plink (v1.9+) supports operations directly on VCF files, as well as conversion of VCF files to other data formats supported by plink. Below shows examples of some plink operations on the 1000 Genomes VCF files.
Extract data from chr15 for the CEU population
# First, make a sample file that can be read by plink
# Historically, plink data includes family IDs and sample IDs for each sample
# We do not have family information, so will use the sample ID for both fields
# See https://www.cog-genomics.org/plink/1.9/filter#indiv
cat igsr_samples.tsv | \
grep "1000 Genomes phase 3 release" | \
awk -F"\t" '($4=="CEU")' | \
awk '{print $1 "\t" $1}' > ceu_samples.txt
# Now, run plink to subset
# Option info:
# --vcf uses a VCF file as input
# --out gives the prefix of output files
# --make-bed specifies to output plink binary format (BED/BIM/FAM)
# --keep specifies to keep only the samples in the provided file
# --double-id means to use the sample ID from the VCF as the family ID
# See https://www.cog-genomics.org/plink/1.9/input#vcf for more info
VCF=~/public/1000Genomes/ALL.chr15.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
plink --vcf $VCF \
--out ceu_chr15 \
--make-bed \
--keep ceu_samples.txt \
--double-id
Compute Hardy-Weinberg Equilibrium statistics for CEU samples SNP rs12913832
# See output file ceu_hardy.hwe
VCF=~/public/1000Genomes/ALL.chr15.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
plink --vcf $VCF \
--hardy \
--keep ceu_samples.txt \
--double-id \
--snp rs12913832 \
--out ceu_hardy
Wrangling 1000 Genomes VCF files with bcftools#
bcftools
is another package with very useful VCF wrangling capabilities.
You are encouraged to check these out here: https://samtools.github.io/bcftools/bcftools.html
We put some examples here you might find helpful using the bcftools query
utility, which can output information from VCF files in customized text formats, and also allows for a rich set of options to filter data by sample, position, VCF fields, etc.
Extract genotypes for a single sample (e.g. NA12878) in tab-delimited format
VCF=~/public/1000Genomes/ALL.chr15.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
bcftools query -s NA12878 -f '%CHROM\t%POS\t[%GT\n]' $VCF
Count the number of singleton variants in this sample, restricting to bi-allelic sites. A singleton is a variant for which only a single copy of the alternate allele was seen in the entire cohort. The command below uses the
-i
option of bcftools to restrict to sites where the number of alternate alleles is 1 and the alternate allele count is 1. See more on bcftools expressions.
bcftools query --include "N_ALT=1 && INFO/AC=1" \
-s NA12878 -f '[%GT\n]' $VCF | \
grep -v "0|0" | wc -l