Genomics

Read me

Modules
VCF file
Missing genotypes
VariantAnnotation package
PLINK2 and chrX issue

Read a VCF file

Small VCF file
Read only a range of SNPs
Read a huge VCF file

Filter SNPs using R

Read VCF file
Keep only biallelic SNPs
Drop SNPs at masked bases
Remove SNPs close to indels
Drop null alleles
Drop spanning deletions
Save filtered VCF file
Filter indels

Sample quality

Missingness
Inbreeding coefficient
Kinship

Subset a VCF file

Select subset of individuals
Select subset of SNPs

Make a table of SNPs

In R
Using GATK instead

Miscellaneous tasks

Read genome into R
Read BED files
Convert between genomes
Rename a chromosome
Write to a VCF file

Principal components (PCA)

PCA with PLINK2
PLINK with pruning
PCA using pcadapt

Divergence metrics

Extract window genotypes
Allele frequencies
CSS score between 2 groups
Fst
VarcompA

Impute genotypes

Filter snps
Impute

SNP-calling pipeline

Reference genome
Index genome file

Inspect short reads

Read stats

Align reads

bwa command
Go faster

Call SNPs - make VCF file

Read processing
Bam quality metrics
Run HaplotypeCaller
Make database
Joint genotyping

Extract variants and invariants

Extract SNPs
Extract indels
Extract invariant sites

Plot SNP quality

Use random sample of snps
Read quality metrics
Plot genotype quality

Hard filtering

Hard-filter snps
Hard-filter indels
Recode missing genotypes
Keep only biallelic variants

Merge VCF files

GatherVcfs

Run R

Log in
Software available
Run programs
Run R
R in batch mode
Use a scratch folder
Interactive mode

More job tips

Check disk storage space
Check your priority in the job queue
SLURM batch jobs

A combination of R and other programs can be used to plot, filter, select, and analyze genome resequence data (as well as genotyping-by-sequencing, or GBS, data).

First, I show how to read a VCF file, quality filter it, convert a VCF file to a SNP table, how to carry out principal components analysis on SNP data, and how to compute divergence metrics like allele frequency differences and Fst.

I also describe more specialized tasks such as how to extract a subset of samples and SNPs from a VCF file, how to read a genome fasta file into R, and how to convert (lifting over) genome coordinates from one reference genome to another.

Finally, I describe our GATK4 pipeline to call SNPs and generate a VCF file – how we get from short read fastq files to VCF files of SNP data using the GATK4 best practices. This section is less about R.

Command syntax below uses that for running software on Digital Research Alliance (a.k.a. Compute Canada), but other systems will be similar. Information on how to log in and run R and other programs on UBC computers is given at the end. My examples ignore paths to files in other folders. Modify as needed.

Read me

Modules

Most large computer systems will require you to load program modules before you can run them. The code to do so is computer system-specific. Here are example load commands for most the examples on this page when running them on Compute Canada (Digital Alliance) computers. Other computer systems will be similar. Program versions are indicated after the / and will need to be updated periodically.

# R and Bioconductor
module load StdEnv/2023 gcc/12.3 r-bundle-bioconductor/3.18

# bcftools (for manipulating VCF files)
module load bcftools/1.13

# beagle (for imputation)
module load beagle/5.4

# bwa (for aligning reads)
module load bwa/0.7.17

# gatk (for SNP calling and filtering VCF files)
module load gatk/4.2.4.0

# htslib (for bgzip and tabix)
module load htslib/1.15.1

# plink2 (for PCA)
module load plink/2.00-10252019-avx2

# samtools (for processing bam files)
module load samtools/1.10

# vcftools (for filtering VCF files)
module load vcftools/0.1.16

VCF file

I’ll start this R tips page by assuming that you already have a clean VCF file containing called SNPs only. Jump to the section below for details on our SNP-calling pipeline that yields such a VCF file.

Missing genotypes

Recent versions of GATK4 code missing genotypes as 0/0, i.e., the homozygous reference genotype instead of the missing symbol, ./. This is not a good idea. We include a step in our pipeline to recode low quality genotype calls back to ./.

VariantAnnotation package

This is the most important R package for processing VCF files and SNP data. It is part of the Bioconductor project and is more versatile than other packages I’ve tried. Instructions on how to install Bioconductor packages are here. For example, install VariantAnnotation as follows.

Make sure you have the latest version of R installed first. Then check the Bioconductor page for the latest version compatible with your R version. Next, install the installer if you don’t already have it. Then install the package (here, version 3.17).

install.packages("BiocManager")
BiocManager::install(version = "3.17", update = FALSE, ask = FALSE)

Finally, install VariantAnnotation. Other Bioconductor packages can be installed the same way.

BiocManager::install("VariantAnnotation", update = FALSE, ask = FALSE)

PLINK2 and chrX issue

I use PLINK2 for principal components analysis, but it has a surprise: any chromosome named “chrX” will be treated as the X chromosome, even if it is actually chromosome 10. PLINK has no option to turn this off. The only available solution is to rename chrX. I show how to rename a chromosome below.

Read a VCF file

This is typically how you’ll read a SNP data file into R.

Small VCF file

Start by reading a small VCF file or you will quickly run out of memory. You’ll need to specify which fields to read. The more fields you read, the more memory you will require. I assume that your file is gz-compressed, but R readVcf() will also work on uncompressed VCF files.

R will create names for the SNPs.

The R vcf object is named vcf in this example.

library(VariantAnnotation)

# VCF file name
vcfFileName <- "small.vcf.gz"

# Choose which VCF fields to read (here, the ALT alleles and the genotypes)
param <- ScanVcfParam(fixed = c("ALT"), geno = "GT", info = NA)

# Read VCF file into R
vcf <- readVcf(file = vcfFileName, param = param)

# View a few snps
rowRanges(vcf)

# Peek at the first 5 columns and rows of the SNP genotypes
geno(vcf)$GT[1:5,1:5]

# Extract the sample names
colnames(vcf)

# View the SNP names R has constructed
head(names(vcf))

Read only a range of SNPs

A VCF file might be huge, but you can select a range of SNPs to read into memory.

In stickleback, an inversion polymorphism on chromosome XXI runs from about position 9914802 to position 11605355 in v5 of the reference genome. To grab the SNPs in this interval from a VCF file, try the following.

library(VariantAnnotation)

vcfFileName <- "global.chrXXI.snp.vcf.gz"
myInterval <- makeGRangesFromDataFrame(data.frame(
                seqnames = "chrXXI", start = 9914802, end = 11605355))
myInterval
param <- ScanVcfParam(fixed = "ALT", geno = c("GT"), info = NA, 
                        which = myInterval)
vcf <- readVcf(vcfname, param = param)

# Confirm that the SNPs in the file are within myInterval.
rowRanges(vcf)

Read a huge VCF file

Reading a big VCF file and many of its fields can be done in chunks to save memory. Then combine the chunks to create a single vcf object retained in memory.

library(VariantAnnotation)

# 1. Choose the fields to read
param <- ScanVcfParam(fixed = c("ALT"), geno = c("GT"), 
            info = c("DP"))

# 2. Decide on a chunk size (number of lines read at a time)
chunksize <- 500000

# 3. Open VCF file for reading in chunks
myVcfFile <- VcfFile("global.chrXXI.snp.vcf.gz", 
                yieldSize = chunksize)
open(myVcfFile)

# 4. Read VCF file in chunks and keep them in the list object vcfchunks
vcfchunks <- list()
k <- 1
while (nrow( vcfchunks[[k]] <- readVcf(myVcfFile, param = param)) ){
        # Print progress
        cat("new lines read:", dim(vcfchunks[[k]]), "\n")
        k <- k + 1
        }

# 5. Combine the chunks into a single vcf object
vcf <- vcfchunks[[1]]
for(j in 2:length(vcfchunks)){
    vcf <- rbind(vcf, vcfchunks[[j]])
  }

# 6. Close the input file
close(myVcfFile)
rm(vcfchunks)

Filter SNPs using R

R can be used to carry out additional filtering steps on a VCF file containing only SNPs. At the end of this section I also show how to filter a VCF file containing only indels.

Read VCF file

Read the VCF file into R. Here, I read only the ALT alleles and leave out the genotypes, which are not needed yet. Leaving them out will speed up the tasks.

library(VariantAnnotation)

vcfFileName <- ""global.chrXXI.snp.vcf.gz""
param <- ScanVcfParam(fixed = c("ALT"), geno = NA, info = NA)
vcf <- readVcf(file = vcfFileName, param = param)

Keep only biallelic SNPs

If you haven’t already dropped SNPs having more than one ALT allele, you can do it in R with the following commands. Invariant sites are retained.

# Count the number of entries in the ALT field at each snp
#   (invariants will have a single entry, though blank)
altAlleles <- CharacterList(alt(vcf))
nAltAlleles <- sapply(altAlleles, length)

# Frequency of ALT allele numbers
table(nAltAlleles)

# Keep only SNPs with a single entry in the ALT column
vcf <- vcf[nAltAlleles == 1]

# Number of SNPs remaining
nrow(vcf)

Drop SNPs at masked bases

Most reference genomes have an accompanying file that identifies interspersed repeats and low complexity DNA sequences created using RepeatMasker or similar program. Below I show how to drop these sites from your VCF file.

First, download the Repeatmasker (eg ) file online. For stickleback reference genome v5, the file is stickleback_v5_repeatMasker.gff.gz available at the genome site.

Then read and save these masked repeat regions to a local file as follows.

# Read file
x <- read.table(gzfile("stickleback_v5_repeatMasker.gff.gz"), 
        sep = "\t", fill = TRUE, stringsAsFactors = FALSE)

# Keep the columns indicating start and end positions of repeats
x1 <- x[, c(1,4,5)]
names(x1) <- c("chr","start","end")

# Split into a list object by chromosome
repeatMaskedBases <- split(x1, x1$chr)
length(repeatMaskedBases)
# [1] 24

# Check names
names(repeatMaskedBases)

# Save
save(repeatMaskedBases, file = "stickleback_v5_RepeatMaskedBases.rdd")

Then drop SNPs at masked bases from your vcf object (here named vcf).

# Load saved R object
load("stickleback_v5_RepeatMaskedBases.rdd")

# Pull out repeat regions in the chromosome of interest (here, chrXXI)
repeats <- repeatMaskedBases[["chrXXI"]]

if(length(repeats) > 0){
  
  # Make an object containing the start and end positions of repeats
  repeatMaskedRanges <- IRanges(start = repeats$start, end = repeats$end)
    
  # Make a similar object with start and end positions of SNPs
  snpRanges <- IRanges(start = start(vcf), end = end(vcf))
    
  # Find which SNPs overlap the repeat regions
  z1 <- findOverlaps(query = repeatMaskedRanges, subject = snpRanges)
  z2 <- unique(subjectHits(z1))
    
  # Remove them
  if(length(z2) > 0) vcf <- vcf[-z2]
  }

# Number of SNPs remaining
nrow(vcf)

Remove SNPs close to indels

The following commands drop SNPs very close to indels (within 3 bases). This requires reading indel positions from an indel-only file created using GATK. My example here uses a hard-filtered indel file named “global.chrXXI.indel.vcf.gz”.

# Read the VCF file of indels - only their positions are needed
indelVcfFileName <- "global.chrXXI.indel.vcf.gz"
indelvcf <- readVcf(file = indelVcfFileName, param = 
              ScanVcfParam(fixed = NA, geno = NA, info = NA))   

# Make an object containing the start and end positions of indels
indelRanges <- IRanges(start = start(indelvcf), end = end(indelvcf))

# Make an object with ranges spanning 3 bases to either side of SNPs in the vcf object. 
snpRanges <- IRanges(start = start(vcf) - 3, end = end(vcf) + 3)

# Find the overlaps
z1 <- findOverlaps(query = indelRanges, subject = snpRanges)
z2 <- unique(subjectHits(z1))

# Delete SNPs close to indels
if(length(z2) > 0) vcf <- vcf[-z2]

# Number of SNPs remaining
nrow(vcf)

Drop null alleles

Your VCF file might still contain some unwanted SNPs that could cause issues in downstream analyses. For example, it might contain null ALT alleles. To remove them, add the following steps in your R code.

ALT <- CharacterList(alt(vcf))
has.null.allele <- sapply(ALT, function(x){"" %in% x})
vcf <- vcf[!has.null.allele]

Drop spanning deletions

Similarly, your VCF file might still contain SNPs with spanning deletions (indicated by "*"). To remove them, add the following steps in your R code.

ALT <- CharacterList(alt(vcf))
has.span.deletion <- sapply(ALT, function(x){"*" %in% x})
vcf <- vcf[!has.span.deletion]

Save filtered VCF file

First, make a file that tabulates all the remaining SNPs in your vcf object – those that survived the above filtering steps.

chrom <- as.character(seqnames(vcf))
pos <- start(vcf)
write.table(paste(chrom, pos, sep = "\t"), file = "global.chrXXI.keep.txt", 
    row.names = FALSE, col.names = FALSE, quote = FALSE, sep = "\n")

Next, make an new VCF file that keeps only the SNPs that passed the R filters. This is most easily done with vcftools in unix. The commands below use the syntax to run vcftools (and htslib to compress and index the new VCF file) on UBC Digital Alliance computers, but the syntax on your computer system will be similar.

# in unix
vcftools --gzvcf global.chrXXI.snp.vcf.gz \
    --positions global.chrXXI.keep.txt --recode --recode-INFO-all \
    --out global.chrXXI.keep
bgzip global.chrXXI.keep.recode.vcf
mv global.chrXXI.keep.recode.vcf.gz global.chrXXI.snp.Rfiltered.vcf.gz
tabix -p vcf global.chrXXI.snp.Rfiltered.vcf.gz

Filter indels

A VCF file containing only indels can also be filtered. Here is how to keep only biallelic indels and also drop positions corresponding to masked bases.

library(VariantAnnotation)

# Identify input file and choose VCF files to read
vcfFileName <- "global.chrXXI.snp.vcf.gz"
param <- ScanVcfParam(fixed = c("ALT"), geno = NA, info = NA)

# Read VCF file into R
vcf <- readVcf(file = vcfFileName, 
                param = param)

# Show the number of indels
nrow(vcf)

# View indels
rowRanges(vcf)

# To locate biallelic indels, count number of entries in the ALT field
# (invariants will have a single entry, though blank)
altAlleles <- CharacterList(alt(vcf))
nAltAlleles <- sapply(altAlleles, length)

# Frequency of ALT allele numbers
table(nAltAlleles)

# Keep only biallelic indels
vcf <- vcf[nAltAlleles == 1]

# Number of SNPs remaining
nrow(vcf)

# To locate indels at masked sites, load R object repeatMaskedBases
load("stickleback_v5_RepeatMaskedBases.rdd")

# Pull out repeat regions in the chromosome of interest (here, chrXXI)
repeats <- repeatMaskedBases[["chrXXI"]]

if(length(repeats) > 0){
  
  # Make an object containing the start and end positions of repeats
  repeatMaskedRanges <- IRanges(start = repeats$start, end = repeats$end)
    
  # Make a similar object with start and end positions of indels
  indelRanges <- IRanges(start = start(vcf), end = end(vcf))
    
  # Find which indels overlap the repeat regions
  z1 <- findOverlaps(query = repeatMaskedRanges, subject = indelRanges)
  z2 <- unique(subjectHits(z1))
    
  # Remove them
  if(length(z2) > 0) vcf <- vcf[-z2]
  }

# Number of SNPs remaining
nrow(vcf)

# Tabulate remaining indel positions to a text file
chrom <- as.character(seqnames(vcf))
pos <- start(vcf)
write.table(paste(chrom, pos, sep = "\t"), file = "global.chrXXI.keep.txt", 
    row.names = FALSE, col.names = FALSE, quote = FALSE, sep = "\n")

Finally, extract and write the list of saved indels to a new VCF file using vcftools in unix.

vcftools --gzvcf global.chrXXI.indel.vcf.gz \
    --positions global.chrXXI.keep.txt --recode --recode-INFO-all \
    --out global.chrXXI.keep
bgzip global.chrXXI.keep.recode.vcf
mv global.chrXXI.keep.recode.vcf.gz global.chrXXI.indel.Rfiltered.vcf.gz
tabix -p vcf global.chrXXI.indel.Rfiltered.vcf.gz

Sample quality

Missingness

To identify (and then exclude) samples with a high proportion of missing genotypes, use the following commands. In the example below, the results are outputted to a text file named global.chrXXI.snp.Rfiltered.imiss. The column of interest is F_MISS, the fraction of missing genotypes for every sample. This example computes the fraction for one chromosome only.

vcftools --gzvcf global.chrXXI.snp.Rfiltered.vcf.gz \
  --missing-indv --out global.chrXXI.snp.Rfiltered

Inbreeding coefficient

The following command calculates the inbreeding coefficient $F$ for each sample. The results are outputted to a text file named global.chrXXI.snp.Rfiltered.het. This example computes $F$ for one chromosome only.

vcftools --gzvcf DATASET.chrM.snp.Rfiltered.vcf.gz \
  --het --out DATASET.chrM.snp.Rfiltered

Kinship

To detect cases of high kinship between samples, first use GATK to merge SNP data for individual chromosomes. The following example for stickleback leaves out chrM and the sex chromosomes chrXIX and chrY.

I use PLINK2 for this purpose, but it has a surprise: any chromosome named “chrX” will be treated as the X chromosome, even if it is actually chromosome 10. PLINK has no option to turn this off. The only available solution is to rename chrX. I show how to rename a chromosome below.

gatk --java-options "-Xmx4G" GatherVcfs \
        -I global1330.chrI.snp.Rfiltered.vcf.gz \
        -I global1330.chrII.snp.Rfiltered.vcf.gz \
        -I global1330.chrIII.snp.Rfiltered.vcf.gz \
        -I global1330.chrIV.snp.Rfiltered.vcf.gz \
        -I global1330.chrIX.snp.Rfiltered.vcf.gz \
        -I global1330.chrUn.snp.Rfiltered.vcf.gz \
        -I global1330.chrV.snp.Rfiltered.vcf.gz \
        -I global1330.chrVI.snp.Rfiltered.vcf.gz \
        -I global1330.chrVII.snp.Rfiltered.vcf.gz \
        -I global1330.chrVIII.snp.Rfiltered.vcf.gz \
        -I global1330.chrX.snp.Rfiltered.vcf.gz \
        -I global1330.chrXI.snp.Rfiltered.vcf.gz \
        -I global1330.chrXII.snp.Rfiltered.vcf.gz \
        -I global1330.chrXIII.snp.Rfiltered.vcf.gz \
        -I global1330.chrXIV.snp.Rfiltered.vcf.gz \
        -I global1330.chrXV.snp.Rfiltered.vcf.gz \
        -I global1330.chrXVI.snp.Rfiltered.vcf.gz \
        -I global1330.chrXVII.snp.Rfiltered.vcf.gz \
        -I global1330.chrXVIII.snp.Rfiltered.vcf.gz \
        -I global1330.chrXX.snp.Rfiltered.vcf.gz \
        -I global1330.chrXXI.snp.Rfiltered.vcf.gz \
        -O global1330.merged.snp.Rfiltered.vcf.gz
    gatk IndexFeatureFile -I global1330.merged.snp.Rfiltered.vcf.gz

Then use PLINK to calculate King relatedness. Results in the following example are written to the text file global1330.merged.snp.Rfiltered.kin0.

plink2 --vcf global1330.merged.snp.Rfiltered.vcf.gz --set-all-var-ids @:# \
    --max-alleles 2 --maf 0.05 --allow-extra-chr --make-bed \
    --out global1330.merged.snp.Rfiltered

plink2 --bfile global1330.merged.snp.Rfiltered --allow-extra-chr \
    --make-king-table --out global1330.merged.snp.Rfiltered

Subset a VCF file

Select subset of individuals

To extract data for a subset of individuals, you can use GATK’s SelectVariants and an expression to identify the chosen sample names. vcftools also has commands to accomplish this.

The following example code selects all individuals whose sample names include the label “Benthic”, but excluding Benthics also having the label “Emily”, and additionally leaving out two specific Benthic individuals.

Some SNPs might now be invariant within the subset. Including the optional --remove-unused-alternates will remove ALT alleles in such cases and make them easier to detect. Invariant sites will be indicated by a “.” in the ALT field of the output VCF file.

gatk --java-options "-Xmx2g" SelectVariants \
  -V global.chrXXI.snp.filtered.vcf.gz \
  -se "Benthic" \
  -xl-se "Emily" \
  -xl-sn "PaxBenthic2010.PAX-B-113" \
  -xl-sn "PaxBenthic2010.PAX-B-114" \
  --remove-unused-alternates \
  -O subset.chrXXI.snp.filtered.vcf.gz

Select subset of SNPs

vcftools has a variety of tools to subset SNPs. For example, the commands below keep only those SNPs genotyped in at least 75% of individuals.

vcftools --gzvcf global.chrXXI.snp.vcf.gz \
    --max-missing 0.75 --recode --recode-INFO-all \
    --out global.chrXXI.snp75
bgzip global.chrXXI.snp75.recode.vcf
mv global.chrXXI.snp75.recode.vcf.gz global.chrXXI.snp75.vcf.gz
tabix -p vcf global.chrXXI.snp75.vcf.gz

Make a table of SNPs

In R

The following commands read a VCF file and creates a SNP table. Works ONLY on single-nucleotide and bi-allelic SNPs (otherwise genotype is set to missing).

Genotypes are coded 0 = missing, 1 = 0/0, 2 = 0/1 or 1/0 and 3 = 1/1. Phased genotypes (e.g., “1|0”) are coded similarly. The rows of the table will have the samples, and the columns will have the SNPs.

A “0” is assigned either if the genotype is missing or if the SNP is an an invariant and no ALT allele is present in the sample (ALT allele indicated by “.”)

library(VariantAnnotation)

# Read the VCF file
vcfFileName <- "myVcfFile.snp.gz"
param <- ScanVcfParam(fixed = c("ALT"), geno = "GT", info = NA)
vcf <- readVcf(file = vcfFileName, param = param)

# Convert to a SNP table
snps <- genotypeToSnpMatrix(vcf) # rows = samples, columns = snps
z <- apply(snps$genotypes@.Data, 2, function(x) as.integer(x))
rownames(z) <- colnames(geno(vcf)$GT)
snps <- as.data.frame(z)

This method to write (and read) .csv files is familiar but slow.

# Write to a csv file
csvFileName <- sub(".vcf.gz$", ".snptable.csv", vcfFileName)
write.csv(snps, file = csvFileName, row.names = TRUE)

# Read the SNP table from the saved .csv file
# Use check.names = FALSE to prevent R from revising SNP and sample names
snps <- read.csv(file = csvFileName, row.names = 1, check.names = FALSE)

Use fwrite() and fread() in the data.table package for a much faster write/read.

library(data.table)

# Write to a csv file
csvFileName <- sub(".vcf.gz$", ".snptable.csv", vcfFileName)
fwrite(snps, file = csvFileName, sep = ",", 
    col.names = TRUE, row.names = TRUE) 

# Read the SNP table from the saved .csv file
snps <- as.data.frame(fread(file = csvFileName))
rownames(snps) <- snps[, 1]
snps <- snps[, -1]

Using GATK instead

GATK can also make a table of SNP genotypes from a VCF file, which we can convert to a numeric code by counting the number of ALT alleles for each genotype (0, 1 or 2; NA = missing). Note that this number coding is different from the one resulting from the above method using VariantAnnotation in R.

If a SNP is an invariant in the sample (ALT allele indicated by “.”) then all the non-missing genotypes at that position will be coded as “0” rather than as NA.

Here, the VariantsToTable output table will have the columns CHROM, POS, REF ALT and GT. This indicates the chromosome name, SNP position, the nucleotide for the REF allele, the nucleotide for ALT allele, and then many columns of genotypes, one for each sample individual.

# in unix
gatk VariantsToTable \
     -V myVcfFile.snp.gz \
     -F CHROM -F POS -F REF -F ALT -GF GT\
     -O myTable.snp.txt

Read the output into R using fread() from the data.table package (much faster than read.csv()).

library(data.table)
library(stringr)

# Using fread() from data.table because it is faster than read.csv()
x <- fread(file = "myTable.snp.txt", sep = "\t")

# Convert missing genotypes to missing
x[x == "./."] <- NA

# Create SNP names similar to those by VariantANnotation
snp_name <- paste0(x$CHROM, ":", x$POS, "_", x$REF, "/", x$ALT)

# Recover the sample names
fishid <- sub(".GT$", "", names(x)[5:ncol(x)])

# Code the genotypes as numeric by counting the number of ALT alleles per genotype
snps <- apply(x, 1, function(x){
    str_count(x[5:length(x)], coll(x["ALT"]))
    })
snps <- as.data.frame(snps)
rownames(snps) <- fishid
names(snps) <- snp_name

# Peek at the results
snps[1:10,1:4]

Miscellaneous tasks

Read genome into R

How to read a fasta genome file into R.

The method below is much faster than using seqinr. However, it converts all lower case letters to upper case.

library(Biostrings) 

genome <- readDNAStringSet("stickleback_v5_assembly.fa.gz", "fasta")

# Chromosome names
names(genome)

# Peek at the contents
genome

# Extract the sequence from chrI positions 1000 to 2500
myseq <- subseq(genome["chrI"], start= 1000, end= 1000)
myseq

# Show only the sequence information
as.character(myseq)

# Check for any ambiguous letter codes
head(alphabetFrequency(genome))

Read BED files

BED files records genomic positions of SNPs using 0-based, half-open intervals. This means the start coordinate of a range is 0-based (first base is 0) and the end coordinate is not included in the interval.

The Bioconductor package rtracklayer can read ordinary BED files having seqnames, start and end positions for genomic ranges, and other optional columns.

The import() command imports the BED file into a Bioconductor GRanges object and automatically converts positions to 1-based closed intervals. The start and end coordinates are both included in the interval, and the first base is 1. If your BED file has a range chr1 1000 2000, it is imported into a GRanges object as chr1 1001-2000. The command as.data.frame() converts the GRanges object to a data frame.

library(rtracklayer)

x <- import("genoData.bed", format = "BED")
geno <- as.data.frame(x)

PLINK BED files come as a trio of files: .bed, .bim, and .fam created using the --make-bed argument in a PLINK command. To read a PLINK BED file in R, use the read.plink() function from the snpStats package.

For example, to read a file “global.chrXXI.PLINK.bed” that was output from PLINK2, use the following code. The result will be a data frame with the following coding:
0 = Missing data
1 = Homozygous for the reference allele (“0/0”)
2 = Heterozygous (“0/1” or “1/0”)
3 = Homozygous for the alternate allele (“1/1”)

library(snpStats)

bedfile <- "global.chrXXI.PLINK.bed"
bimfile <- "global.chrXXI.PLINK.bim"
famfile <- "global.chrXXI.PLINK.fam"
x <- read.plink(bedfile, bim = bimfile, fam = famfile)
geno <- as.data.frame(x$genotypes@.Data)

Alternatively, use PLINK itself to export a BED file to a VCF file that can be read into R. The VCF file will need to be indexed with tabix.

# in unix
plink2 --bfile global.chrXXI.PLINK --export vcf bgz --out global.chrXXI.PLINK
tabix -p vcf global.chrXXI.PLINK.vcf.gz

Convert between genomes

SNP nucleotide positions in one version of a reference genome can be converted to corresponding positions for the same SNP in a new reference genome if a liftover file has been created.

For example, SNPs discovered using v1 of the stickleback reference genome (Jones et al 2012; Nature) can be converted to their corresponding positions in version 5 of the stickleback reference genome using the liftover file, v1_withChrUn_to_v5.chain.txt, available at Mike White’s genome browser site.

Note: this file does not include liftovers for chrM, which are unchanged. It also does not include chrY, which was not part of the v1 genome.

To demonstrate, I convert from v1 coordinates to v5 coordinates the SNPs in the array developed by Jones et al. 2012; Current Biology

First, read the v1 to v5 liftover file.

library(rtracklayer)
library(GenomicRanges)

# Read liftover file
v1.to.v5_chain <- import.chain("v1_withChrUn_to_v5.chain.txt")

# Check chromosome names
names(v1.to.v5_chain)

Next, read the SNP positions in genome v1. I used the .csv file kingsley_snps.csv, which contains information on 1491 stickleback SNPs used in the array by Jones et al. and obtained from the Supplement of that paper and from NCBI where the data are deposited.

The commands below put the SNP positions into a genomic ranges object (containing start and end positions of genomic intervals) with the SNP positions according to the v1 reference genome.

# Read the file
kingsley_snps <- read.csv("kingsley_snps.csv")
head(kingsley_snps)

# Number of snps
nrow(kingsley_snps)

# Extract the relevant columns - strand is optional and can be left out
v1snps <- data.frame(seqnames = kingsley_snps$chr, 
        start = kingsley_snps$POS, end = kingsley_snps$POS, 
        strand = kingsley_snps$strand)

# Make the genomic ranges object with start and end positions of each snp
v1Coordinates <- makeGRangesFromDataFrame(v1snps)
v1Coordinates

Finally, convert the SNP positions to genome v5 coordinates.

The next command carries out the conversion. Each row of v1Coordinates correspond to a list element in v5Coordinates. In this example, 19 of 1491 SNPs could not be converted to the new coordinates and 17 SNPs on chrM were ignored.

# Convert from v1 to v5 positions
v5Coordinates <- liftOver(v1Coordinates, v1.to.v5_chain)

To coerce result to a data frame, use the following (will drop missing values). The variable group will correspond to the rows of the original v1 data frame.

v5snps <- as.data.frame(v5Coordinates)

nrow(v5snps)
head(v5snps)

Rename a chromosome

Do this, for example, before using PLINK2 to avoid issues with chromosome names (PLINK2 will assume that anything named “chrX” is the X chromosome, even if it is just the name you’ve given to the 10th autosome.

The first method uses R. The following code renames the chromosome “chrX” to “chr10” in a VCF object. The code assumes you’ve already read the VCF file into memory in an object named vcf using readVcf() command from the VariantAnnotation package.

# Grab the current chromosome names to a vector
chrNames <- seqlevels(vcf)

# Change the name of chrX in the vector
chrNames[chrNames == "chrX"] <- "chr10"

# Use the changed vector to rename the chromosome in the VCF object
vcf <- renameSeqlevels(vcf, chrNames)

# Check
seqlevels(vcf)

The second method uses bcftools in unix. First, you need to create a text file newChrNames.txt containing “oldChrName newChrName” pairs, with the old and new name for each chromosome on a line separated by a white space, and each chromosome on a separate line in the file.

bcftools annotate --rename-chrs newChrNames.txt global1330.merged.snp.vcf.gz\
    -Oz -o global1330.renamedChr.snp.vcf.gz
tabix -p vcf global1330.renamedChr.snp.vcf.gz

Write to a VCF file

The following code writes a VCF object named vcf in R to a VCF file. The new file will be compressed and indexed and will have .bgz appended to the name instead of .gz.

# Write to a new VCF file
writeVcf(vcf, file = "newfile.vcf", index = TRUE)

# If you want to rename the output to .gz and .tbi:
system(paste("mv", "newfile.vcf.bgz", "newfile.vcf.gz"))
system(paste("mv", "newfile.vcf.bgz.tbi", "newfile.vcf.gz.tbi"))

Principal components (PCA)

There are many ways to do PCA in R. However, one needs a method to account for missing genotypes.

One solution to the problem of missing genotypes is to impute missing genotypes, but this isn’t necessary unless there are very many missing values.

PCA with PLINK2

PCA in PLINK2 works for autosomes only but is really fast. PCA using the earlier version PLINK v1.9 is described at Ravinet and Meier’s helpful speciation genomics web site

The method has a surprise: any chromosome named “chrX” will be treated as the X chromosome, even if it is actually chromosome 10. PLINK2 has no option to turn this off. The only available solution is to rename chrX. I show how to rename a chromosome here.

PLINK2 obtains principal components from a variance-standardized relationship matrix (which is like a correlation matrix). It handles missing genotypes by calculating the variance-standardized relationship for each pair of individuals using just the non-missing SNPs they share. This is fine if genotypes are missing at random, but it is difficult to know this.

The calculation is sensitive to minor allele frequency, so an maf (minor allele frequency) cutoff is typically specified.

Here are the commands if running Compute Canada software. The argument --max-alleles 2 is redundant if you’ve already filtered out multi-allelic sites.

# Make BED file and other files needed for pca
plink2 --vcf global.chrXXI.snp.Rfiltered.vcf.gz --set-all-var-ids @:# \
  --max-alleles 2 --maf 0.01 --allow-extra-chr --make-bed \
  --out global.chrXXI.snp.Rfiltered

# Carry out PCA
plink2 --bfile global.chrXXI.snp.Rfiltered --pca biallelic-var-wts\
  --allow-extra-chr --out global.chrXXI.snp.Rfiltered

PLINK was designed for analyses of the human genome. The argument --set-all-var-ids allows nonstandard names for snps; --allow-extra-chr permits non-standard chromosome names; king-cutoff sets a cutoff for relatedness measured by kinship (duplicate samples have kinship 0.5, whereas 0.25 is the expected value for parent-child or full-sib); --maf sets a cutoff for minor allele frequency.

Plot the PCA scores in R.

# in R
pca <- read.table("global.chrXXI.snp.Rfiltered.eigenvec", 
                sep = "\t", header = TRUE, comment.char = "")
plot(PC2 ~ PC1, data = pca)

# Eigenvalues
cat global.chrXXI.snp.Rfiltered.eigenval

# SNP loadings
pca.loadings <- read.table("global.chrXXI.snp.Rfiltered.eigenvec.var"), 
                        sep = "\t", header = TRUE, comment.char = "")

PLINK with pruning

PLINK2 will prune markers in high linkage disequilibrium with one another. Whether this is desirable depends on your project.

Heed my above warning about PLINK2 treating any chromosome named “chrX” as the X chromosome, even if it is actually chromosome 10. To rename a chromosome see here.

The argument --indep-pairwise is the command that performs linkage pruning. The first number, 50, sets a window size of 50 Kb. The second argument, 10 is the window step size. The third sets an $r^2$ threshold for the amount of linkage considered acceptable. Here we prune SNPs that have an $r^2$ greater than 0.3.

# Make BED and other files
plink2 --vcf global.chrXXI.snp.Rfiltered.vcf.gz --set-all-var-ids @:# \
  --max-alleles 2 --maf 0.01 --indep-pairwise 50 10 0.3 \
  --allow-extra-chr --make-bed --out global.chrXXI.snp.Rfiltered

# PCA
plink2 --bfile global.chrXXI.snp.Rfiltered \
  --extract global.chrXXI.snp.Rfiltered.prune.in --pca --allow-extra-chr \
  --out global.chrXXI.snp.Rfiltered

Plot the PCA scores in R.

# in R
pca <- read.table("global.chrXXI.snp.Rfiltered.eigenvec", 
                sep = "\t", header = TRUE, comment.char = "")
plot(PC2 ~ PC1, data = pca)

PCA using `pcadapt`

pcadapt is an R package that uses PCA to detect local adaptation. You may have to install it first.

In the first step, use PLINK2 to convert genotype data in a VCF file to the BED file format (here shown without using pruning). Heed my earlier warning that PLINK2 will treat any chromosome named “chrX” as the X chromosome, even if it is actually chromosome 10. To rename a chromosome see here.

# Make BED and other files
plink2 --vcf global.chrXXI.snp.Rfiltered.vcf.gz --set-all-var-ids @:# \
    --max-alleles 2 --maf 0.01 --allow-extra-chr --make-bed \
  --out global.chrXXI.snp.Rfiltered

Then run pcadapt in R.

library(pcadapt)
library(Variantnnotation)

# Grab individual sample names
x <- read.table(famfile.in)
ID <- x[, 2]

# Optional commands to make grouping variable from ID. 
# Here I extract info from names to indicate ecotype (i.e., benthic or limnetic)
# Your command will be different if you are using other labels or population types
ecotype <- rep("limnetic", length(ID))
ecotype[grep("enthic", ID)] <- "benthic"

# Read SNP names from PLINK bim file
snp <- read.table("global.chrXXI.snp.Rfiltered.bim")
names(snp) <- c("chr", "SNP", "cM", "position", "ALT", "REF")

# PCA with 10 components saved
bed <- read.pcadapt("global.chrXXI.snp.Rfiltered.bed", type = "bed")
PCA <- pcadapt(input = bed, K = 10, min.maf = 0.01) 
names(PCA)
     # [1] "scores"          "singular.values" "loadings"        "zscores"        
     # [5] "af"              "maf"             "chi2.stat"       "stat"           
     # [9] "gif"             "pvalues"         "pass"         

# Obtain PCA loadings (NA means SNP was an invariant or otherwise dropped)
pca.loadings <- as.data.frame(PCA$loadings, row.names = snp$SNP)
names(pca.loadings) = paste0("PC", 1:ncol(PCA$loadings))

# Obtain PCA scores
pca.scores <- as.data.frame(PCA$scores, row.names = ID)
colnames(pca.scores) <- paste0("PC", 1:ncol(pca.scores))

# scree plot of eigenvalues
plot(PCA, option = "screeplot")

# plot of PC scores
plot(PCA, option = "scores", pop = ecotype)
plot(PCA, option = "scores", pop = lake)
plot(PCA, option = "scores", i = 3, j = 4, pop = lake)

Divergence metrics

This section shows methods to compare populations or species in allele frequencies and other metrics for a fixed block or window of the genome, emphasizing methods available in R.

Extract window genotypes

We begin by specifying the range of bases to be included in the window and then extract the corresponding genotypes from a VCF file.

library(VariantAnnotation)

# Choose which chromosome
chrname <- "chrXXI"

# Choose the SNP vcf file
vcfFileName <- "global.chrXXI.snp.Rfiltered.vcf.gz"

# Choose which populations to compare
# Here I'm choosing two groups from the Little Campbell River
sampleNames <- scanBcfHeader(file = vcfFileName)[[1]]$Sample
keepSamples <- sampleNames[grep("campbell", sampleNames, ignore.case = TRUE)]
head(keepSamples)

# Set start and end base pairs of window (here, a 10K base window)
range <- c(2000001, 2010000)

# Read the vcf block
param <- ScanVcfParam(fixed = NA, geno = c("GT"), info = NA, 
          samples = keepSamples,
          which = GRanges(chrname, IRanges(start = range[1], end = range[2])))
vcf <- readVcf(file = vcfFileName, genome = "stickleback_v5_assembly.fa.gz", 
          param = param)

# Extract genotypes and transpose
GT <- as.data.frame(t(geno(vcf)$GT))
rm(vcf)

# Make a unique label for the populations
# Here, the first 6 letters of the names are enough
pop <- substr(rownames(GT), 1, 6)
popfreq <- table(pop)
popfreq
# Marine Stream 
        # 69      9

# Identify SNPs that are invariants in this sample
is.invariant <- sapply(GT, function(x){
  x <- gsub("\\||\\/", "", x) 
  length(unique(x[x != ".."])) == 1
  })
table(is.invariant)

# Drop the invariant sites
GT <- GT[, !is.invariant]

# Replace missing genotype symbol with NA
GT[GT == "./."] <- NA

# Keep only SNPs with >= half individuals genotyped in each pop
ngeno <- aggregate(GT, by = list(pop), FUN = function(x){sum(!is.na(x))})[,-1]
test <- apply(ngeno, 2, function(x){ all(x > 0.5 * popfreq)})

GT <- GT[, test]

Allele frequencies

This code calculates the number and frequency of alleles in each group in the window. This code is for bi-allelic SNPs only. To begin, run the commands above to obtain pop and GT for a predetermined window of bases, dropping the invariant sites, replacing missing genotype symbols with NA, and and dropping SNPs with too few genotyped individuals per population.

library(stringr)

# Calculate number of REF alleles in each genotype, by population
nref <- aggregate(GT, by = list(pop), 
          FUN = function(GT){sum(str_count(GT, pattern = "0"), na.rm = TRUE)})
rownames(nref) <- nref[,1]

# Calculate number of ALT alleles in each genotype, by population
nalt <- aggregate(GT, by = list(pop), 
          FUN = function(GT){sum(str_count(GT, pattern = "1"), na.rm = TRUE)})
rownames(nalt) <- nalt[,1]

# Calculate ALT allele frequency, by population
alleleFreq <- nalt[,-1]/(nref[,-1] + nalt[,-1])

# Mean absolute allele frequency difference in window
alleleFreqDiffMean <- mean(unlist((abs(alleleFreq[1,] - 
                        alleleFreq[2,]))), na.rm = TRUE)
alleleFreqDiffMean

CSS score between 2 groups

The Cluster Separation Score (CSS) was introduced by Jones et al 2012; Nature as a window-based metric of divergence between two populations. It measures genetic difference between all pairs of individuals as euclidean distance in two dimensions obtained from an MDS analysis of proportion sequence divergence. The CSS score is the average distance between pairs of individuals belonging to different populations minus the average distance between pairs belonging to the same populations

The following code is a slight variation of the original method in Jones et al. To begin, run the commands above to obtain pop and GT for a predetermined window of bases, dropping the invariant sites, replacing missing genotype symbols with NA, and dropping SNPs with too few genotyped individuals per population.

CSS is a measure of divergence between groups and it grows with the size of the window and with increasing number of snps. To make the measurement comparable between different windows, divide CSS by the total number of called bases in each window of predetermined quality (including snps, indels, and invariants).

# Sort the rows of GT
GT <- GT[order(pop), ]
pop <- pop[order(pop)]
poptype <- sort(unique(pop))
if(length(poptype) != 2) stop("CSS only works for two population types")

# Count number of ALT alleles per genotype (snps become rows)
genotypes <- apply(GT, 1, str_count, pattern = "1")

# List all possible pairs of individuals
colpairs <- combn(1:ncol(genotypes), 2)

# List pops of all possible pairs of individuals
gtpairs <- combn(pop, 2, FUN = function(x){paste(x, collapse=",")})

# Identify within and between-population pairs in pairwise combinations
ii <- gtpairs == paste(poptype[1], poptype[1], sep = ",") # cases of "1,1"
jj <- gtpairs == paste(poptype[2], poptype[2], sep = ",") # cases of "2,2"
ij <- gtpairs == paste(poptype[1], poptype[2], sep = ",") # cases of "1,2"
ji <- gtpairs == paste(poptype[2], poptype[1], sep = ",") # cases of "2,1"
m <- sum(pop == poptype[1])
n <- sum(pop == poptype[2])

# Calculate a measure of proportion sequence divergence between 
#   all pairs of individuals at every snp:
# 0 = same genotype, 1 = different genotype, 0.5 = a shared allele

psd <- apply(colpairs, 2, function(x){
        psd <- abs(genotypes[ , x[1]] - genotypes[ , x[2]])/2
        })
colnames(psd) <- gtpairs

# What to do with a missing value for psd?
# Here, I replace missing psd with the mean psd of its poptype comparison:
# The mean of psd between individuals belonging to poptype 1, or
#   the mean of psd between individuals belonging to poptype 2, or
#   the mean of psd between individuals belonging to different poptypes.

# Assign the average psd's to cases of psd = NA
psdUnique <- unique(colnames(psd))
for(p in psdUnique){
    doCols <- c(grep(p, colnames(psd)))
    rowMean <- apply(psd[ , doCols], 1, mean, na.rm=TRUE)
    for(k in doCols){
        psd[is.na(psd[ , k]) , k] <- rowMean[is.na(psd[ , k])]
        }           
  }

# Pairwise differences summed across the window
psdSum <- colSums(psd) 

# Initialize distance matrix
d <- matrix(0, nrow = length(pop), ncol = length(pop))

# Place pairwise distances into distance matrix
d[lower.tri(d,diag = FALSE)] <- psdSum
d <- t(d)
d[lower.tri(d,diag = FALSE)] <- psdSum
rownames(d) <- pop
colnames(d) <- pop

# Obtain first 2 MDS axes
z <- cmdscale(d, k = 2, add = TRUE)$points
            
# Matrix of pairwise distance between individuals on MDS axes
euclid <- as.matrix(dist(z, method = "euclidean")) 
euclid.lower <- euclid[lower.tri(euclid)]

# Sum of differences within and between groups
sii <- sum(euclid.lower[ii])
sjj <- sum(euclid.lower[jj])
sij <- sum( sum(euclid.lower[ij]), sum(euclid.lower[ji]), na.rm=TRUE )

# CSS score
CSS <- ( sij/(m*n) ) - (1/(m+n))*( sii/((m-1)/2) + sjj/((n-1)/2) ) 
CSS

Fst

Here’s how we calculate Fst (method of Weir and Cockerham 1984) using the hierfstat package. The advantage of this package is that it provides all the variance components needed to calculate Fst. The calculation is not fast, however. The fst_WC84() function in the assigner package is said to be much faster. Using it instead requires additional steps and packages to get the data in the right format (not shown here).

Fst is a ratio of the variance among groups divided by the total variance. To get Fst for a window, calculate it as the ratio of the sums of the variance components across SNPs (not the average of the Fst’s). Use the same approach to calculate Fst across multiple windows and the whole genome.

To begin, run the commands above to obtain pop and GT for a predetermined window of bases, dropping the invariant sites, replacing missing genotype symbols with NA, and dropping SNPs with too few genotyped individuals per population. The code below assumes biallelic SNPs only.

library(hierfstat)

# Convert genotypes to hierfstat format - code assumes biallelic SNPs only
genotypes <- GT
genotypes[genotypes == "0/0" | genotypes == "0|0"] <- 11
genotypes[genotypes == "0/1" | genotypes == "1/0"] <- 12
genotypes[genotypes == "0|1" | genotypes == "1|0"] <- 12
genotypes[genotypes == "1/1" | genotypes == "1|1"] <- 22

# Convert character data to numeric
# (note: rownames are lost in the process)
genotypes <- data.frame(lapply(genotypes, as.integer))

# Group ID also needs to be numeric
group <- as.integer(as.factor(pop))

# Weir-Cockerham Fst
z <- wc(cbind(group, genotypes))

# Contents of Fst object
names(z)
    # [1] "call"      "sigma"     "sigma.loc" "per.al"    "per.loc"   "FST"       "FIS"  

# Window Fst
z
    # $FST
    # [1] 0.8322499
    
    # $FIS
    # [1] 0.1918404

# z$sigma.loc contains the variance components per snp
head(z$sigma.loc)
                               # lsiga         lsigb      lsigw
    # chrXXI.2000027_G.T -0.0003637520  7.696227e-05 0.01315789
    # chrXXI.2000036_C.T -0.0003732924  8.085008e-05 0.01351351
    # chrXXI.2000061_G.A -0.0004972829  8.852259e-05 0.01388889
    # chrXXI.2000066_A.G  0.0021377646 -5.930962e-04 0.02666667
    # chrXXI.2000100_G.A -0.0004333496  8.680556e-05 0.01388889
    # chrXXI.2000108_G.T  0.0795661414  1.245831e-02 0.02739726

# Sum the variance components across SNPs in the window
varCompSum <- colSums(z$sigma.loc)
varCompSum
        # lsiga     lsigb     lsigw 
    # 57.698847  2.231086  9.398820 

# Window Fst
varCompSum$lsiga / sum(varCompSum)
    # [1] 0.8322499

VarcompA

The numerator of Fst (calculated in the previous section) is a measure of absolute divergence between populations, similar to dxy. As a window-based measure of divergence, divide the sum of varCompSum$lsiga by the total number of called bases in each window of predetermined quality (including snps, indels, and invariants).

Impute genotypes

Instructions here use Beagle to impute missing genotypes using identity by descent.

To begin, make sure to save only biallelic genotypes to a VCF file in which missing genotypes are coded as “./.” and not “0/0”.

Filter snps

It would be wise to reduce the VCF file to a subset of SNPs successfully genotyped in a high fraction of individuals. The commands below keep SNPs genotyped in at least 75% of individuals.

vcftools --gzvcf global.chrXXI.snp.Rfiltered.vcf.gz \
    --max-missing 0.75 --recode --recode-INFO-all \
    --out global.chrXXI.snp.Rfiltered75
bgzip global.chrXXI.snp.Rfiltered75.recode.vcf
mv global.chrXXI.snp.Rfiltered75.recode.vcf.gz global.chrXXI.snp.Rfiltered75.vcf.gz
tabix -p vcf global.chrXXI.snp.Rfiltered75.vcf.gz

Impute

Now, impute the missing genotypes. The commands below for genotpe imputation are those used on a Compute Canada (Digital Alliance) machine. Don’t forget to load the module first. To speed up the calculations I requested 16 cpu’s with 2G of memory per cpu (32G in total).

The following command imputes genotypes using the existing genotypes. The output is a VCF file with imputed genotypes.

java -Xmx31g -jar ${EBROOTBEAGLE}/beagle.22Jul22.46e.jar nthreads=16 \ 
  gt=global.chrXXI.snp.Rfiltered75.vcf.gz out=global.chrXXI.snp.Rfiltered75.imputed
gatk IndexFeatureFile -I global.chrXXI.snp.Rfiltered75.imputed.vcf.gz

Beagle will also impute genotypes using the genotype likelihoods instead. However, you’ll first need to convert any missing likelihoods. The zcat sequence of commands below replaces missing likelihoods with “0,0,0”.

zcat global.chrXXI.snp.Rfiltered75.vcf.gz | sed 's/\\.,\\.,\\./0,0,0/g' | \
    bgzip > global.chrXXI.snp.edited75.vcf.gz
gatk IndexFeatureFile -I global.chrXXI.snp.edited75.vcf.gz
java -Xmx31g -jar ${EBROOTBEAGLE}/beagle.22Jul22.46e.jar nthreads=16 \ 
  gt=global.chrXXI.snp.edited75.vcf.gz out=global.chrXXI.snp.Rfiltered75.imputed
gatk IndexFeatureFile -I global.chrXXI.snp.Rfiltered75.imputed.vcf.gz

SNP-calling pipeline

This section lists the steps that we use to generate a VCF file starting with fastq files and using a pipeline based on GATK4 best practices.

Reference genome

For stickleback we use the v5 reference genome, a Paxton Lake benthic, available at Mike White’s genome browser site. The downloaded file stickleback_v5_assembly.fa.gz is a gzipped fasta file containing the v5 assembly of the stickleback genome. It includes sequence for 24 components: 20 autosomes, the X chromosome (chrXIX), the Y chromosome (chrY), the mitochondrion (chrM), plus “chrUn”, which is an artificial chromosome consisting of unassembled contigs chained together (individual contigs are separated by a large number of NNNNNNNNNs). To peek at the fasta file, use the unix command

zless stickleback_v5_assembly.fa.gz

Index genome file

The reference genome file need to be indexed before using. Unzip the fasta file. Here’s the command for stickleback.

gunzip stickleback_v5_assembly.fa.gz

To index on Compute Canada:

# Generate bwa index
bwa index -a bwtsw stickleback_v5_assembly.fa

# Generate .fai index files needed by GATK
samtools faidx stickleback_v5_assembly.fa

# Generate .dict file needed by GATK
gatk --java-options "-Xmx2g" CreateSequenceDictionary \
    -R stickleback_v5_assembly.fa

Inspect short reads

R can be used to examine number and quality of short reads.

Paired short reads for an individual are in two large text files in .fastq.gz format. For paired reads, the first file (file name with “R1” or just “1”) will have the forward reads, and the second file (file name with “R2” or “2”) will contain reverse complement reads.

Read stats

The fastq.gz files are generally huge, so the R code below takes a large random sample of reads instead (the ShortRead manual says the default number of reads is 1 million, but I don’t think so).

In R:

library(ShortRead)

# choose a fastq file to inspect
myFileName <- dir(getwd(), "fishName.R1.fastq.gz")

# Take large random sample of reads
reads <- qa(myFileName) 

# Show total number of reads read
reads[["readCounts"]]   

# Show base frequencies
reads[["baseCalls"]]

# Show mean and median base quality
z <- rep(as.integer(rownames(reads[["baseQuality"]])), 
          reads[["baseQuality"]]$count)
mean(z, na.rm = TRUE)
median(z, na.rm = TRUE)

# Plot read quality by cycle
perCycle <- reads[["perCycle"]] 
ShortRead:::.plotCycleQuality(perCycle$quality)

# Create summary report
report(reads, dest = paste(myFileName, "QAreport", sep = "."), type = "html")

The report will be placed in a folder named after the fastq.gz files analyzed. Download the folder to your local machine and double click the index.html file inside the folder to view all the plots in your browser and print to a pdf.

Align reads

We use bwa to align paired reads to the reference genome. The alignments are saved in uncompressed .sam files.

`bwa` command

The basic bwa command is as follows (the backslash continues a single command on the next line).

On Compute Canada:

bwa mem -M stickleback_v5_assembly.fa fishName_R1.fastq.gz \
        fishName_R2.fastq.gz > fishName.sam

Go faster

We can speed up this step. A single bwa job can be divided up into multiple threads that are run concurrently on separate cores. Here, the example command below divides the process among 16 cores, speeding up the whole operation many fold. Most nodes on Compute Canada machines have 32 cores, so you can go up to 32 threads.

Here’s the basic command if running on Compute Canada in interactive mode (the salloc command requests 16 cores (CPU’s) and 1G of memory per core).

# in unix
salloc --time=3:0:0 --ntasks-per-node=16 --mem-per-cpu=1G --account=def-schluter

# Wait for the resources to be made available to you.
# Then type the following (substituting the name of your actual fish)

bwa mem -M -t 16 stickleback_v5_assembly.fa fishName_R1.fastq.gz \
        fishName_R2.fastq.gz > fishName.sam

exit # end interactive session

On the ZCU Cluster, no advance request for resources is required. The command here is

/Linux/bin/bwa-0.7.15 mem -M -t 16 stickleback_v5_assembly.fa fishName_R1.fastq.gz \
        fishName_R2.fastq.gz > fishName.sam

Call SNPs - make VCF file

Here we take each sam file of aligned reads through the bulk of the GATK4 procedures. We use the steps recommended for “germline short variant discovery (snps + Indels)”. See the tool documentation for the version of GATK you are using.

The output of this sequence of steps is a VCF file containing all the joint genotype calls.

Snp calling involves the following steps.

SortSam to sort the reads in the sam file.
MarkDuplicates to identify duplicate reads (drop this step if doing GBS).
AddOrReplaceReadGroups to assign reads to a single new read group.

BaseRecalibrator to make recalibration table for base quality scores (optional).
ApplyBQSR recalibrates the base quality scores (optional).

HaplotypeCaller to call SNPs for individual specimens.
GenomicsDBImport to construct database of gVCF files.
GenotypeGVCFs to genotype all samples jointly.

Base recalibration (last two steps) is optional but we use it in the standard stickleback pipeline You’ll need to ensure that the following files are in your working directory.

stickleback_21genome_snp_v5.1.bed
stickleback_21genome_snp_v5.1.bed.idx

I also use this shorter list of known snps.

knownSnpsv5.1.vcf
knownSnpsv5.1.vcf.idx

These files contain lists of known SNPs in the stickleack genome. The file stickleback_21genome_snp_v5.1.bed lists the highest-confidence SNPs discovered in the study of 21 stickleback genomes sampled from marine and freshwater sites around the northern hemisphere by Jones et al 2012; Nature. The original file was provided by Felicity Jones and lifted over to the v5 genome coordinates. The second list of SNPs was discovered by Jones et al. 2012; Current Biology obtained from the Supplement of that paper and from NCBI where the data are deposited. They were lifted over to the new v5 coordinates as described below

Read processing

Here’s the sequence of commands applied to a single fish whose sam file name is GrowthHormoneTransgenicLCR_A-line.GHT-A1.sam. I had to go up to 24Gb to avoid crashes.

The flag Xmx22g specifies the maximum amount of memory that can be used. Set the number equal to the amount of memory requested for the job minus a little for overhead.

Here’s what the commands would look like in an interactive session on Compute Canada.

# Request resources for interactive session
salloc --time=12:0:0 --ntasks-per-node=1 --mem-per-cpu=24G --account=def-schluter

# You'll have to wait for the resources to become available to continue
gatk --java-options "-Xms256m -Xmx22g -XX:ParallelGCThreads=8" SortSam -I FISHNAME.sam \
        -O FISHNAME.sorted.bam \
        -SO coordinate --CREATE_INDEX TRUE

gatk --java-options "-Xms256m -Xmx30g -XX:ParallelGCThreads=8" MarkDuplicates \
        -I FISHNAME.sorted.bam \
        -O FISHNAME.mkdup.bam \
        -M FISHNAME.mkdup.metrics \
        --REMOVE_DUPLICATES FALSE --ASSUME_SORT_ORDER coordinate

gatk --java-options "-Xms256m -Xmx4g" AddOrReplaceReadGroups \
    -ID GenomeBC.FISHNAME -LB FISHNAME.SB -SM FISHNAME -PL ILLUMINA \
    -PU GenomeBC -I FISHNAME.mkdup.bam -O FISHNAME.rg.bam \
    -SO coordinate --CREATE_INDEX TRUE

# Optional base quality score recalibration:

gatk --java-options "-Xms256m -Xmx4g" BaseRecalibrator \
    -R stickleback_v5_assembly.fa -I FISHNAME.rg.bam \
        --known-sites knownSnpsv5.1.vcf \
    --known-sites stickleback_21genome_snp_v5.1.bed \
    -O FISHNAME.recal.table

gatk --java-options "-Xms256m -Xmx4g" ApplyBQSR -R stickleback_v5_assembly.fa \
    -I FISHNAME.rg.bam \
    --bqsr-recal-file FISHNAME.recal.table \
    -O FISHNAME.recal.bam

exit # end interactive session

Bam quality metrics

The bam files created in the last step contain a lot of information about read pair numbers and coverage along chromosomes. Here’s how to use R to extract some of this information.

The bam files can be huge. The code below examines one chromosome at a time, using chrXXI as an example. You might need to run this code in an interactive session if the memory demands are too great.

library(Rsamtools)
library(bitops)

# select the bamfile and chromosome to use.
bamfile <- "myBamFile.bam"
chr <- "chrXXI"
baifile <- sub("bam$", "bai", bamfile) # name of bam index file

# obtain chromosome length from genome index
faifile <- read.table("stickleback_v5_assembly.fa.fai", row.names = 1)
chrlength <- faifile[chr,2]

# read fields from the bam file
x <- scanBam(bamfile, index = baifile, 
      param = ScanBamParam(what = c("pos","qwidth","flag","strand","isize"),
      which = GRanges(seqnames = chr, IRanges(1, chrlength)) ))[[1]]

# Drop unaligned reads
ind <- !is.na(x[["pos"]]) 
x <- lapply(x, function(x){x[ind]})

# Calculate coverage of mapped reads
ranges1 <- IRanges(start = x$pos, width = x$qwidth,
      names=make.names(x[["qname"]], unique = TRUE)) 
cover <- as.vector(coverage(ranges1))
mean(cover) # mean coverage over all bases
median(cover) # median coverage

# Plot the frequency distribution of coverage.
hist(cover[cover <= 100], right = FALSE, breaks = 50, 
    main="Coverage", col="firebrick")

# Number of mapped reads. 
x[["mapped"]] <- bitAnd(x[["flag"]], 0x0004) != 0x0004  
x[["mappedmate"]] <- bitAnd(x[["flag"]], 0x0008) != 0x0008
x[["mapped.in.properpair"]] <- bitAnd(x[["flag"]], 0x0002) == 0x0002
sum(x[["mapped"]])   # No. mapped reads
table(x[["strand"]]) # No. mapped reads on "+" and "-" strands
sum(x[["mapped"]] & x[["mappedmate"]]) # No. reads whose mate also mapped
sum(x[["mapped.in.properpair"]]) # No. reads mapped in a proper pair

# Plot of position of mapped reads along the chromosome 
# (the 1-based leftmost position of each query on the reference sequence)

hist(x$pos/10^6, right=FALSE, col="firebrick",
    breaks = 200, xlab = "Position (million bases)")

# Alignment sequence (query) width.
median(x[["qwidth"]], na.rm = TRUE)

# Plot of "insert" or "template" length, the number of bases 
# from the leftmost mapped base to the rightmost mapped base. 
# The leftmost segment has a plus sign and the rightmost has a minus sign.
hist(x$isize[abs(x$isize) <= 700], right = FALSE, 
     col = "firebrick", breaks = 100)

Run HaplotypeCaller

This step uses GATK’s HaplotypeCaller to call SNPs for an individual, separately for each chromosome. Commands below use chrXXI as an example. Each chromosome and individual yields a compressed gVCF file. This step can require significant resources for large chromosomes and probably needs to be run in batch mode.

gatk --java-options "-Xmx4g" HaplotypeCaller \
    -R stickleback_v5_assembly.fa -ERC GVCF \
    -I FISHID.recal.bam \
    -O FISHID.chrXXI.g.vcf.gz \
    -L chrXXI \
    --pcr-indel-model NONE \
    --heterozygosity 0.01 --indel-heterozygosity 0.00125 \
    --output-mode EMIT_ALL_ACTIVE_SITES \
    -A DepthPerAlleleBySample -A FisherStrand -A InbreedingCoeff \
    -A MappingQuality -A MappingQualityRankSumTest \
    -A QualByDepth -A ReadPosRankSumTest

Make database

Next, create the database of gVCF files. The database for chrXXI will be located in a folder named chrXXIdb. If this folder already exists, delete it before running.

The task requires a sample map, a tab-delimited text file with two columns. The first column contains the sample names (fish id’s) line by line. The second column contains the gVCF file names. Here are the first few lines of an example file named chrXXI.map to analyze the chromosome chrXXI.

Marine-Pac-SaritaBay_BC.SAR-29    Marine-Pac-SaritaBay_BC.SAR-29.chrXXI.g.vcf.gz
Solitary-CranbyLake.CRL25         Solitary-CranbyLake.CRL25.chrXXI.g.vcf.gz
Marine-Pac-MudLake_AK.ML-39       Marine-Pac-MudLake_AK.ML-39.chrXXI.g.vcf.gz
Solitary-TroutLake.TRL-2          Solitary-TroutLake.TRL-2.chrXXI.g.vcf.gz
Marine-Pac-DoranPark_CA.CA02-29   Marine-Pac-DoranPark_CA.CA02-29.chrXXI.g.vcf.gz
Marine-Pac-MudLake_AK.ML-25       Marine-Pac-MudLake_AK.ML-25.chrXXI.g.vcf.gz
...

The following R commands can help to make the map file if your file names have the format FISHID.chrXXI.g.vcf.gz

# Grab all files whose names end with "chrXXI.g.vcf.gz"
z <- list.files(pattern = "chrXXI.g.vcf.gz$")

# This works if the rest of the file name is the name of the individual
fishname <- sub(".chrXXI.g.vcf.gz", "", z)

# make the map file
chrXXImap <- data.frame(fishname, z)
write.table(chrXXImap, file = "chrXXI.map", quote = FALSE, 
            col.names = FALSE, row.names = FALSE, sep = "\t")

Now construct the database with GATK.

# in unix
gatk --java-options "-Xmx4g" GenomicsDBImport \
    --genomicsdb-workspace-path chrXXIdb \
    --sample-name-map chrXXI.map \
    -L chrXXI

exit

Joint genotyping

Finally, use GenotypeGVCFs to call SNPs and invariant sites jointly on the samples in your database. Here is code for Compute Canada to run it on chrXXI using the database chrXXIdb created in the previous step. In this example, we include invariant as well as variant sites. The output is sent to a VCF file named global.chrXXI.vcf.gz.

NOTE: GATK used to represent uncalled (missing) genotypes as “./.” in the VCF output of GenotypeGVCFs. But as of GATK 4.2.3.0, missing genotypes might appear as “0/0” instead (except in some cases with phased variants). GATK wants us to know that “You’ll still be able to determine the genotype as missing because the FORMAT DP field will be 0, though.” Later, I use a genotype quality metric to designate missing from non-missing.

# in unix
gatk --java-options "-Xmx4g" GenotypeGVCFs \
  -R stickleback_v5_assembly.fa \
  -V gendb://chrXXIdb \
  -L chrXXI \
  --include-non-variant-sites \
  --max-alternate-alleles 3 \
  --standard-min-confidence-threshold-for-calling 20 \
  -O global.chrXXI.vcf.gz

Extract variants and invariants

This section describes several tools to extract snps, indels, and invariants from a VCF file and filter them on the basis of quality. It also describes ways to extract a subset of samples from VCF files and drop ALT alleles unused in the subset.

Extract SNPs

The following command extracts only SNPs (single nucleotide polymorphisms) and saves the results in a new VCF file. Those SNPs that failed the phred-scaled confidence threshold for calling variants, as determined by the argument --standard-min-confidence-threshold-for-calling in the GenotypeGVCFs command (I used 20, the default is 30), are given a LowQual value in the FILTER column and are removed by the --exclude-filtered argument in the SelectVariants command below.

Note that in GATK, SNPs include single nucleotide substitutions as well as spanning deletions. The ALT allele for a spanning deletion is indicated as * in the VCF file. Most are labeled LowQual and filtered out. Not all downstream programs will accept spanning deletions.

gatk --java-options "-Xmx2g" SelectVariants \
  -V global.chrXXI.vcf.gz \
  --select-type-to-include SNP \
  --exclude-filtered \
  -O global.chrXXI.snp.vcf.gz

Extract indels

The following command extracts only indels (insertions and deletions) and saves the results in a new VCF file. Those SNPs that failed the phred-scaled confidence threshold for calling variants are given a LowQual value in the FILTER column and are removed by the --exclude-filtered argument in the command below.

gatk --java-options "-Xmx2g" SelectVariants \
  -V global.chrXXI.vcf.gz \
  --select-type-to-include INDEL \
  --exclude-filtered \
  -O global.chrXXI.indel.vcf.gz

Extract invariant sites

The VCF output of this command includes only invariants – neither SNPs nor indels.

Add the argument --exclude-filtered in the SelectVariants command below (as in the commands above to extract SNPs or indels)) to leave out sites having LowQual in the FILTER column—these are sites that failed the minimum threshold for calling a SNP and will have ‘.’ or ’*’ in the ALT column. Here, I did not include this quality filter and instead prefer to filter invariant genotypes according to genotype quality at in a later stage in the analysis.

gatk --java-options "-Xmx2g" SelectVariants \
  -V global.chrXXI.vcf.gz \
  --select-type-to-include NO_VARIATION \
  -O global.chrXXI.invar.vcf.gz

Plot SNP quality

Before filtering out low-quality SNPs based on annotations provided in the VCF file, it is a good idea to plot the relevant metrics to get a sense of whether the recommended filtering cutoffs make sense for your data. To do this you’ll need to read some of the contents of the VCF file into R.

A large random sample of SNPs should suffice. If you really want to read the whole VCF file and are running into memory limits (most likely if you are also reading genotype fields too, such as GT and GQ), read the VCF file in chunks, as explained here

Use random sample of snps

The following code (obtained from VCF-tricks) samples 10,000 SNPs (modify below by editing number in ‘shuf -n 10000’). It assumes that the number of header lines (leading lines beginning with “#” in the VCF file) is less than 2000 (modify by editing number in ‘head -n 2000’).

# Number of SNPs in VCF file (number of lines not beginning with '#')
zcat global.chrXXI.snp.filtered.vcf.gz | zgrep -v '^#' | wc -l 

# Which is the first non-header line? (prints line number and a few characters)
zgrep -v -m1 -n '^#' global.chrXXI.snp.filtered.vcf.gz | head -c 20

# Take a random sample of 10,000 snps
(zcat global.chrXXI.snp.filtered.vcf.gz | head -n 2000 | \
  zgrep '^#' ; zgrep -v '^#' global.chrXXI.snp.filtered.vcf.gz | \
  shuf -n 10000 | LC_ALL=C sort -k1,1V -k2,2n) | \
  bgzip > global.chrXXI.snp.10K.vcf.gz
tabix -p vcf global.chrXXI.snp.10K.vcf.gz

Read quality metrics

Now, choose the quality metrics of interest. For example, here I select the total read depth (DP) of each snp, the Quality by Depth (QD) score, and the Fisher’s strand bias (FS) score. Remove the ‘info =’ argument entirely to read all available INFO fields. To save memory, leave out the genotype fields (geno = NA).

# in R:
library(VariantAnnotation)

# Choose which VCF fields to read
param <- ScanVcfParam(fixed = c("ALT"), geno = NA, 
            info = c("DP", "QD", "FS"))

Read the specified fields into a vcf object in R.

# Read VCF file into R
vcf <- readVcf(file = "global.chrXXI.snp.10K.vcf.gz", 
        genome = "stickleback_v5_assembly.fa.gz", 
                param = param)

# View snps
rowRanges(vcf)

# View INFO field variables read
info(vcf)

# Histogram of DP
hist(info(vcf)$DP, right = FALSE, breaks = 40, col = "firebrick")

# Histogram of QD
hist(info(vcf)$QD, right = FALSE, breaks = 40, col = "firebrick")

# Histogram of FS
hist(info(vcf)$FS, right = FALSE, breaks = 40, col = "firebrick")

Plot genotype quality

Set geno = c("GQ") when reading the random sample of SNPs to view the distribution of genotype quality scores (GQ).

param <- ScanVcfParam(fixed = NA, geno = c("GQ"), info = NA)

# Read VCF file into R
vcf <- readVcf(file = "global.chrXXI.snp.10K.vcf.gz", 
        genome = "stickleback_v5_assembly.fa.gz", 
                param = param)

GQ <- as.vector(geno(vcf)$GQ)
hist(GQ, right = FALSE, breaks = 40, col = "firebrick")

Hard filtering

We largely follow GATK’s recommendations on thresholds for hard filtering SNPs and indels according to quality metrics contained in the VCF file, but we confirmed and/or modified based on plots of the data (see previous section).

Hard-filter snps

The following VariantFiltration command hard-filters the snps. SNPs failing the filter are not removed yet. Instead, a FILTER variable in the output file is annotated as PASS if the SNP passes the filter. If a SNP fails a filter, the FILTER variable is annotated with the name of the filter it failed.

In the version of GATK used here, I got many warnings concerning MQRankSum and ReadPosRankSum, but online forums about this say that the warnings should be ignored.

If a SNP is missing a quality annotation (e.g., no MQ value is given for a given snp) then by default it passes that test (change this by setting the --missing-values-evaluate-as-failing argument to true).

The subsequent SelectVariants command removes the SNPs failing the hard-filter step.

gatk --java-options "-Xmx4g" VariantFiltration \
  -V global.chrXXI.snp.vcf.gz \
  -filter "QD < 2.0" --filter-name "QD2" \
  -filter "FS > 60.0" --filter-name "FS60" \
  -filter "SOR > 3.0" --filter-name "SOR3" \
  -filter "MQ < 40.0" --filter-name "MQ40" \
  -filter "MQRankSum < -4.0" --filter-name "MQRankSum-4" \
  -filter "ReadPosRankSum < -4.0" --filter-name "ReadPosRankSum-4" \
  -filter "ExcessHet > 54.69" --filter-name "ExcessHet5469" \
  -O global.chrXXI.snp.hf.vcf.gz
gatk --java-options "-Xmx4g" SelectVariants \
  -V global.chrXXI.snp.hf.vcf.gz \
  --exclude-filtered \
  -O global.chrXXI.snp.filtered.vcf.gz

Hard-filter indels

We used the following code to hard filter indels:

gatk --java-options "-Xmx2g" VariantFiltration \
    -V global.chrXXI.indel.vcf.gz \
    -filter "QD < 2.0" --filter-name "QD2" \
    -filter "QUAL < 20.0" --filter-name "Q20" \
    -filter "FS > 60.0" --filter-name "FS60" \
    -filter "ReadPosRankSum < -4.0" \--filter-name "ReadPosRankSum-4" \
    -filter "ExcessHet > 54.69" --filter-name "ExcessHet5469" \
    -O global.chrXXI.indel.hf.vcf.gz

# Now,remove indels that failed the filters
gatk --java-options "-Xmx2g" SelectVariants \
  -V global.chrXXI.indel.hf.vcf.gz \
  --exclude-filtered \
  -O global.chrXXI.indel.filtered.vcf.gz

Recode missing genotypes

Recent versions of GATK indicate missing genotypes as “0/0”, i.e., the homozygous reference genotype instead of the missing symbol, “./.”. We need to recode to distinguish missing genotypes from true reference homozygotes.

Filtering SNPs and indels according to genotype quality (GQ) will convert missing and other poor quality genotypes to “./.” Here I recode as missing (“./.”) any genotype in the SNP VCF file whose Genotype Quality (GQ) is less than 10.

After hard-filtering genotypes, some ALT alleles at a SNP might no longer occur in genotypes present in the file. The additional SelectVariants command will clear them out. Some SNPs might become invariants — such sites will still be present in the output VCF file, but will be indicated by a “.” in the ALT field.

vcftools --gzvcf global.chrXXI.snp.filtered.vcf.gz \
  --minGQ 10 --recode --recode-INFO-all \
  --out global.chrXXI.snp.filtered
bgzip global.chrXXI.snp.filtered.recode.vcf > global.chrXXI.snp.filtered.recode.vcf.gz
mv global.chrXXI.snp.filtered.recode.vcf.gz global.chrXXI.snp.GQfiltered.vcf.gz
tabix -p vcf global.chrXXI.snp.filtered.recode.vcf.gz

gatk --java-options "-Xmx4g" SelectVariants \
    -V global.chrXXI.snp.filtered.recode.vcf.gz \
    --remove-unused-alternates \
    -O global.chrXXI.snp.GQfiltered.vcf.gz

Here is the same procedure applied to indels.

vcftools --gzvcf global.chrXXI.indel.filtered.vcf.gz \
  --minGQ 10 --recode --recode-INFO-all \
  --out global.chrXXI.indel.filtered
bgzip global.chrXXI.indel.filtered.recode.vcf > global.chrXXI.indel.filtered.recode.vcf.gz
mv global.chrXXI.indel.filtered.recode.vcf.gz global.chrXXI.indel.GQfiltered.vcf.gz
tabix -p vcf global.chrXXI.indel.filtered.recode.vcf.gz

gatk --java-options "-Xmx4g" SelectVariants \
    -V global.chrXXI.indel.filtered.recode.vcf.gz \
    --remove-unused-alternates \
    -O global.chrXXI.indel.GQfiltered.vcf.gz

Here is a similar procedure applied to invariant sites. The invariant VCF file file has an RGQ field indicating genotype quality rather than a GQ field. We are lenient and use a mininum read depth DP of 1 to call genotypes. Genotypes based on DP < 1 are converted to missing.

vcftools --gzvcf global.chrXXI.invar.vcf.gz \
    --minDP 1 --recode --recode-INFO-all \
    --out global.chrXXI.invar.filtered
mv global.chrXXI.invar.filtered.recode.vcf global.chrXXI.invar.DPfiltered.vcf
bgzip global.chrXXI.invar.DPfiltered.vcf
tabix -p vcf global.chrXXI.invar.DPfiltered.vcf.gz

Keep only biallelic variants

Many analysts keep only biallelic SNPs (and indels). This step can be accomplished by inserting the extra line
--restrict-alleles-to BIALLELIC \
into one the above SelectVariants commands, perhaps right after --remove-unused-alternates.

Alternatively, this can be done later in R, which I show here.

Merge VCF files

After processing VCF files separately by chromosome, it may be necessary to merge them into a single large VCF file. Use GATK’s MergeVcfs instead if the chromosomes aren’t listed in the same order as in the headers of each of the individual files. Use SortVcf instead if the individual VCF files are not sorted by coordinates.

GatherVcfs

You can use use GatherVcfs in GATK to merge VCF files. The chromosomes should be listed in the same order as in the headers of each of the individual files and cannot have overlapping positions. The samples must be the same in each of the files.

gatk --java-options "-Xmx4G" GatherVcfs \
    -I subset.chrI.snp.filtered.vcf.gz \
    -I subset.chrII.snp.filtered.vcf.gz \
    -I subset.chrIII.snp.filtered.vcf.gz \
    -I subset.chrIV.snp.filtered.vcf.gz \
    -I subset.chrIX.snp.filtered.vcf.gz \
    -I subset.chrV.snp.filtered.vcf.gz \
    -I subset.chrVI.snp.filtered.vcf.gz \
    -I subset.chrVII.snp.filtered.vcf.gz \
    -I subset.chrVIII.snp.filtered.vcf.gz \
    -I subset.chrX.snp.filtered.vcf.gz \
    -I subset.chrXI.snp.filtered.vcf.gz \
    -I subset.chrXII.snp.filtered.vcf.gz \
    -I subset.chrXIII.snp.filtered.vcf.gz \
    -I subset.chrXIV.snp.filtered.vcf.gz \
    -I subset.chrXIX.snp.filtered.vcf.gz \
    -I subset.chrXV.snp.filtered.vcf.gz \
    -I subset.chrXVI.snp.filtered.vcf.gz \
    -I subset.chrXVII.snp.filtered.vcf.gz \
    -I subset.chrXVIII.snp.filtered.vcf.gz \
    -I subset.chrXX.snp.filtered.vcf.gz \
    -I subset.chrXXI.snp.filtered.vcf.gz \
    -O subset.merged.snp.filtered.vcf.gz
gatk IndexFeatureFile -I subset.merged.snp.filtered.vcf.gz

Run R

Log in

Login to a computer system using ssh in a terminal window. From off campus you might need to run myVPN and endure multi-factor authentication.

For example, to login to Graham at Compute Canada:

ssh -X -Y YOUR_NAME@graham.computecanada.ca

Compute Canada keeps track of how efficiently you use its memory and CPU resources. Wasting resources reduces your priority rating (future jobs will sit longer in the queue). So it is a good idea to request the minimum amount that you need, and in the case of interactive more exiting as soon as you are done so resources don’t sit idle. See the section below on monitoring jobs to figure out what resources your successful jobs used.

Software available

Compute Canada has a lot of software pre-installed, including GATK, R, and most Bioconductor packages.

Run programs

If using Compute Canada, you’ll need to load a module.

For example, to run a GATK command (the example here creates a sequence dictionary of the stickleback reference genome), load the gatk module and then run the command

module gatk/4.2.4.0
gatk --java-options "-Xmx2g" CreateSequenceDictionary \
    -R stickleback_v5_assembly.fa

Run R

To run R and Bioconductor on Compute Canada, load the modules first.

module load StdEnv/2023 gcc/12.3 r-bundle-bioconductor/3.18
R

Type library() on the R command line to see which R packages are already available. Additional packages can be installed in your local folder. For example:

# in R
install.packages("hierfstat")

R in batch mode

You’ll need to run R in batch mode if you want to run a script that takes a long time to complete and so needs to be submitted as a job to the scheduler.

The first step is to create an R batch script file (e.g., myRscript.R) that contains the R code you want to run. The file must be a text file and its first line must be #!/usr/bin/Rscript

If it is not a large amount of code you can type the code between quotes in your local R script file and then submit it to the command line with write() to create the R script file, as shown below.

The following example code creates a batch script file that reads a command line argument (here, the name of a chromosome), loads the VariantAnnotation package, reads the VCF file for that chromosome, and makes a histogram of the INFO field DP values (snp read depth). After the code is submitted, run the batch script file with the unix command Rscript myRscript.R chrM.

# in R

myRscript <- '
#!/usr/bin/Rscript

# Read the chromosome name passed to the script on the command line
args <- commandArgs(TRUE)
chrname <- unlist(args[1])

# Use the chromosome name to identify the file to be read
vcfFile <- paste0("mySnps.", chrname, ".vcf.gz")

# Load the VariantAnnotation package
library(VariantAnnotation)  

# Read the VCF file
vcf <- readVcf(file = vcfFile)

# make a histogram of DP field - save to pdf file
pdf(paste0("DP.histogram.", chrname, ".pdf"))
hist(info(vcf)$DP, right = FALSE, breaks = 40, col = "firebrick")
dev.off()
    '

# Write the code to a batch script file
write(myRscript, file = "myRscript.R")

To run the R script in batch mode, you can type the following commands directly into your unix terminal window. More typically, however, you’ll put the same commands into the job file that you will submit to the scheduler.

Note that the name of the chromosome (here, chrM) must be passed as an argument to the script as shown. You’ll need to load any required modules first.

# in unix

module load StdEnv/2023 gcc/12.3 r-bundle-bioconductor/3.18
Rscript myRscript.R chrM

Use a scratch folder

Most work on Compute Canada will be done in a scratch folder, which has lots of disk space but is not backed up. Inactive files eventually expire.

It can be convenient, the first time you use the system, to create a link to your scratch folder by typing the following into a command window.

ln -s /scratch/YOUR_NAME scratch  # make a symbolic link
cd scratch                        # go to your scratch folder

Interactive mode

For tasks requiring significant time and memory resources, try running your code first in interactive mode. This will help you decide how much resources you’ll need if you run similar jobs in batch mode.

On Compute Canada, start an interactive session with the following unix command, which requests 32G of memory on a single core for 1 hour. You may need to wait a while for the interactive job to begin. If your work is done before the requested time is up, it is important to enter exit to end the interactive session so that resources don’t sit idle until your full time runs out.

salloc --time=1:0:0 --ntasks-per-node=1 --mem-per-cpu=32G --account=def-schluter

[type your commmands]

exit

More job tips

Quick reference for a few useful tasks.

Check disk storage space

Lots of files are generated when processing genome sequence data, which starts to eat up disk space. Remove files created along the way that you no longer need.

To check disk space usage on a Compute Canada computer on all file systems, type the following command on your unix command line.

diskusage_report

Check your priority in the job queue

Especially on Compute Canada machines, your priority (affecting how quickly your queued job is run) declines as you use (or waste) resources. It takes a week or so to rebuild. One strategy when this happens is to switch to a different computer system within Compute Canada. If you have been working on Graham and your jobs are bogged down in the queue, move files to Beluga and try again. Priority is computer-specific.

How you tell my group’s priority when you are logged in to a Compute Canada computer system:

sshare -l -A def-schluter_cpu -a --format=account,User,EffectvUsage,LevelFS

SLURM batch jobs

ComputeCanada uses SLURM to schedule batch jobs.

Follow the instructions here on how to create a job script. To submit the job, enter

# in unix
sbatch YOUR_SCRIPT.sh

Other useful commands include:

# Check the status of all jobs submitted
squeue -u YOUR_NAME

# Cancel a running job having job id 123456
scancel 123456

# job ID's of jobs that have run in the last 72 hours
sacct -X --starttime now-72hours -o jobid

# Check resources used by successfully completed job 123456
seff 123456

# Run "seff" on all jobs run within the last 72 hours
for jobid in $(sacct -X -S now-72hours -o jobid --noheader | tr -d ' ');\
  do seff $jobid; done

An ExitCode of 0 means that no errors were generated. A low CPU Efficiency means that you requested more resources time than you needed. A low Memory Efficiency means that you requested many more memory than you needed.

To check resources used by all jobs submitted between two dates.

sacct --starttime now-72hours \
  --format=JobID,JobName,AllocCPUS,ExitCode,MaxRSS,Elapsed,UserCPU

sacct --starttime=yyyy-mm-dd --endtime=yyyy-mm-dd \
  --format=JobID,JobName,AllocCPUS,ExitCode,MaxRSS,Elapsed,UserCPU

Genomics

Read me

Modules

VCF file

Missing genotypes

VariantAnnotation package

PLINK2 and chrX issue

Read a VCF file

Small VCF file

Read only a range of SNPs

Read a huge VCF file

Filter SNPs using R

Read VCF file

Keep only biallelic SNPs

Drop SNPs at masked bases

Remove SNPs close to indels

Drop null alleles

Drop spanning deletions

Save filtered VCF file

Filter indels

Sample quality

Missingness

Inbreeding coefficient

Kinship

Subset a VCF file

Select subset of individuals

Select subset of SNPs

Make a table of SNPs

In R

Using GATK instead

Miscellaneous tasks

Read genome into R

Read BED files

Convert between genomes

Rename a chromosome

Write to a VCF file

Principal components (PCA)

PCA with PLINK2

PLINK with pruning

PCA using pcadapt

Divergence metrics

Extract window genotypes

Allele frequencies

CSS score between 2 groups

Fst

VarcompA

Impute genotypes

Filter snps

Impute

SNP-calling pipeline

Reference genome

Index genome file

Inspect short reads

Read stats

Align reads

bwa command

Go faster

Call SNPs - make VCF file

Read processing

Bam quality metrics

Run HaplotypeCaller

Make database

Joint genotyping

Extract variants and invariants

Extract SNPs

Extract indels

Extract invariant sites

Plot SNP quality

Use random sample of snps

Read quality metrics

Plot genotype quality

Hard filtering

Hard-filter snps

Hard-filter indels

Recode missing genotypes

Keep only biallelic variants

Merge VCF files

GatherVcfs

Run R

Log in

PCA using `pcadapt`

`bwa` command