A combination of R and other programs can be used to plot, filter, select, and analyze genome resequence data (as well as genotyping-by-sequencing, or GBS, data).
First, I show how to read a VCF file, quality filter it, convert a VCF file to a SNP table, how to carry out principal components analysis on SNP data, and how to compute divergence metrics like allele frequency differences and Fst.
I also describe more specialized tasks such as how to extract a subset of samples and SNPs from a VCF file, how to read a genome fasta file into R, and how to convert (lifting over) genome coordinates from one reference genome to another.
Finally, I describe our GATK4 pipeline to call SNPs and generate a
VCF file – how we get from short read fastq
files to VCF files of SNP
data using the GATK4 best practices. This section is less about R.
Command syntax below uses that for running software on Digital Research Alliance (a.k.a. Compute Canada), but other systems will be similar. Information on how to log in and run programs on UBC computers is given at the end. My examples ignore paths to files in other folders. Modify as needed.
Most large computer systems will require you to load program modules
before you can run them. The code to do so is computer system-specific.
Here are example load commands for most the examples o nthis page when
running them on Compute Canada (Digital Alliance) computers. Other
computer systems will be similar. Program versions are indicated after
the /
and will need to be updated periodically.
# R and Bioconductor
module load gcc/9.3.0 r/4.1.0 r-bundle-bioconductor/3.14
# bcftools (for manipulating VCF files)
module load bcftools/1.13
# beagle (for imputation)
module load beagle/5.4
# bwa (for aligning reads)
module load bwa/0.7.17
# gatk (for SNP calling and filtering VCF files)
module load gatk/4.2.4.0
# htslib (for bgzip and tabix)
module load htslib/1.15.1
# plink2 (for PCA)
module load plink/2.00-10252019-avx2
# samtools (for processing bam files)
module load samtools/1.10
# vcftools (for filtering VCF files)
module load vcftools/0.1.16
I’ll start this R tips page by assuming that you already have a clean VCF file containing called SNPs only. Jump to the section below for details on our SNP-calling pipeline that yields such a VCF file.
Recent versions of GATK4 code missing genotypes as 0/0
,
i.e., the homozygous reference genotype instead of the missing symbol,
./.
This is not a good idea. We include a step in our
pipeline to recode low quality genotype calls
back to ./.
VariantAnnotation
This is the most important R package for processing VCF files and SNP
data. It is part of the Bioconductor project and is more versatile than
other packages I’ve tried. Instructions on how to install Bioconductor
packages are here.
For example, install VariantAnnotation
as follows.
Make sure you have the latest version of R installed first. Then check the Bioconductor page for the latest version compatible with your R version. Next, install the installer if you don’t already have it. Then install the package (here, version 3.17).
install.packages("BiocManager")
BiocManager::install(version = "3.17", update = FALSE, ask = FALSE)
Finally, install VariantAnnotation
. Other Bioconductor
packages can be installed the same way.
BiocManager::install("VariantAnnotation", update = FALSE, ask = FALSE)
I use PLINK2 for principal components analysis, but it has a surprise: any chromosome named “chrX” will be treated as the X chromosome, even if it is actually chromosome 10. PLINK has no option to turn this off. The only available solution is to rename chrX. I show how to rename a chromosome below.
This is typically how you’ll read a SNP data file into R.
Start by reading a small VCF file or you will quickly run out of
memory. You’ll need to specify which fields to read. The more fields you
read, the more memory you will require. I assume that your file is
gz-compressed, but R readVcf()
will also work on
uncompressed VCF files.
R will create names for the SNPs.
The R vcf object is named vcf
in this example.
library(VariantAnnotation)
# VCF file name
vcfFileName <- "small.vcf.gz"
# Choose which VCF fields to read (here, the ALT alleles and the genotypes)
param <- ScanVcfParam(fixed = c("ALT"), geno = "GT", info = NA)
# Read VCF file into R
vcf <- readVcf(file = vcfFileName, param = param)
# View a few snps
rowRanges(vcf)
# Peek at the first 5 columns and rows of the SNP genotypes
geno(vcf)$GT[1:5,1:5]
# Extract the sample names
colnames(vcf)
# View the SNP names R has constructed
head(names(vcf))
A VCF file might be huge, but you can select a range of SNPs to read into memory.
In stickleback, an inversion polymorphism on chromosome XXI runs from about position 9914802 to position 11605355 in v5 of the reference genome. To grab the SNPs in this interval from a VCF file, try the following.
library(VariantAnnotation)
vcfFileName <- "global.chrXXI.snp.vcf.gz"
myInterval <- makeGRangesFromDataFrame(data.frame(
seqnames = "chrXXI", start = 9914802, end = 11605355))
myInterval
param <- ScanVcfParam(fixed = "ALT", geno = c("GT"), info = NA,
which = myInterval)
vcf <- readVcf(vcfname, param = param)
# Confirm that the SNPs in the file are within myInterval.
rowRanges(vcf)
Reading a big VCF file and many of its fields can be done in chunks to save memory. Then combine the chunks to create a single vcf object retained in memory.
library(VariantAnnotation)
# 1. Choose the fields to read
param <- ScanVcfParam(fixed = c("ALT"), geno = c("GT"),
info = c("DP"))
# 2. Decide on a chunk size (number of lines read at a time)
chunksize <- 500000
# 3. Open VCF file for reading in chunks
myVcfFile <- VcfFile("global.chrXXI.snp.vcf.gz",
yieldSize = chunksize)
open(myVcfFile)
# 4. Read VCF file in chunks and keep them in the list object vcfchunks
vcfchunks <- list()
k <- 1
while (nrow( vcfchunks[[k]] <- readVcf(myVcfFile, param = param)) ){
# Print progress
cat("new lines read:", dim(vcfchunks[[k]]), "\n")
k <- k + 1
}
# 5. Combine the chunks into a single vcf object
vcf <- vcfchunks[[1]]
for(j in 2:length(vcfchunks)){
vcf <- rbind(vcf, vcfchunks[[j]])
}
# 6. Close the input file
close(myVcfFile)
rm(vcfchunks)
R can be used to carry out additional filtering steps on a VCF file containing only SNPs. At the end of this section I also show how to filter a VCF file containing only indels.
Read the VCF file into R. Here, I read only the ALT alleles and leave out the genotypes, which are not needed yet. Leaving them out will speed up the tasks.
library(VariantAnnotation)
vcfFileName <- ""global.chrXXI.snp.vcf.gz""
param <- ScanVcfParam(fixed = c("ALT"), geno = NA, info = NA)
vcf <- readVcf(file = vcfFileName, param = param)
If you haven’t already dropped SNPs having more than one ALT allele, you can do it in R with the following commands. Invariant sites are retained.
# Count the number of entries in the ALT field at each snp
# (invariants will have a single entry, though blank)
altAlleles <- CharacterList(alt(vcf))
nAltAlleles <- sapply(altAlleles, length)
# Frequency of ALT allele numbers
table(nAltAlleles)
# Keep only SNPs with a single entry in the ALT column
vcf <- vcf[nAltAlleles == 1]
# Number of SNPs remaining
nrow(vcf)
Most reference genomes have an accompanying file that identifies
interspersed repeats and low complexity DNA sequences created using
RepeatMasker
or similar program. Below I show how to drop
these sites from your VCF file.
First, download the Repeatmasker
(eg ) file online. For
stickleback reference genome v5, the file is
stickleback_v5_repeatMasker.gff.gz
available at the genome
site.
Then read and save these masked repeat regions to a local file as follows.
# Read file
x <- read.table(gzfile("stickleback_v5_repeatMasker.gff.gz"),
sep = "\t", fill = TRUE, stringsAsFactors = FALSE)
# Keep the columns indicating start and end positions of repeats
x1 <- x[, c(1,4,5)]
names(x1) <- c("chr","start","end")
# Split into a list object by chromosome
repeatMaskedBases <- split(x1, x1$chr)
length(repeatMaskedBases)
# [1] 24
# Check names
names(repeatMaskedBases)
# Save
save(repeatMaskedBases, file = "stickleback_v5_RepeatMaskedBases.rdd")
Then drop SNPs at masked bases from your vcf object (here named
vcf
).
# Load saved R object
load("stickleback_v5_RepeatMaskedBases.rdd")
# Pull out repeat regions in the chromosome of interest (here, chrXXI)
repeats <- repeatMaskedBases[["chrXXI"]]
if(length(repeats) > 0){
# Make an object containing the start and end positions of repeats
repeatMaskedRanges <- IRanges(start = repeats$start, end = repeats$end)
# Make a similar object with start and end positions of SNPs
snpRanges <- IRanges(start = start(vcf), end = end(vcf))
# Find which SNPs overlap the repeat regions
z1 <- findOverlaps(query = repeatMaskedRanges, subject = snpRanges)
z2 <- unique(subjectHits(z1))
# Remove them
if(length(z2) > 0) vcf <- vcf[-z2]
}
# Number of SNPs remaining
nrow(vcf)
The following commands drop SNPs very close to indels (within 3 bases). This requires reading indel positions from an indel-only file created using GATK. My example here uses a hard-filtered indel file named “global.chrXXI.indel.vcf.gz”.
# Read the VCF file of indels - only their positions are needed
indelVcfFileName <- "global.chrXXI.indel.vcf.gz"
indelvcf <- readVcf(file = indelVcfFileName, param =
ScanVcfParam(fixed = NA, geno = NA, info = NA))
# Make an object containing the start and end positions of indels
indelRanges <- IRanges(start = start(indelvcf), end = end(indelvcf))
# Make an object with ranges spanning 3 bases to either side of SNPs in the vcf object.
snpRanges <- IRanges(start = start(vcf) - 3, end = end(vcf) + 3)
# Find the overlaps
z1 <- findOverlaps(query = indelRanges, subject = snpRanges)
z2 <- unique(subjectHits(z1))
# Delete SNPs close to indels
if(length(z2) > 0) vcf <- vcf[-z2]
# Number of SNPs remaining
nrow(vcf)
Your VCF file might still contain some unwanted SNPs that could cause issues in downstream analyses. For example, it might contain null ALT alleles. To remove them, add the following steps in your R code.
ALT <- CharacterList(alt(vcf))
has.null.allele <- sapply(ALT, function(x){"" %in% x})
vcf <- vcf[!has.null.allele]
Similarly, your VCF file might still contain SNPs with spanning deletions (indicated by "*"). To remove them, add the following steps in your R code.
ALT <- CharacterList(alt(vcf))
has.span.deletion <- sapply(ALT, function(x){"*" %in% x})
vcf <- vcf[!has.span.deletion]
First, make a file that tabulates all the remaining SNPs in your vcf object – those that survived the above filtering steps.
chrom <- as.character(seqnames(vcf))
pos <- start(vcf)
write.table(paste(chrom, pos, sep = "\t"), file = "global.chrXXI.keep.txt",
row.names = FALSE, col.names = FALSE, quote = FALSE, sep = "\n")
Next, make an new VCF file that keeps only the SNPs that passed the R
filters. This is most easily done with vcftools
in unix.
The commands below use the syntax to run vcftools
(and
htslib
to compress and index the new VCF file) on UBC
Digital Alliance computers, but the syntax on your computer system will
be similar.
# in unix
vcftools --gzvcf global.chrXXI.snp.vcf.gz \
--positions global.chrXXI.keep.txt --recode --recode-INFO-all \
--out global.chrXXI.keep
bgzip global.chrXXI.keep.recode.vcf
mv global.chrXXI.keep.recode.vcf.gz global.chrXXI.snp.Rfiltered.vcf.gz
tabix -p vcf global.chrXXI.snp.Rfiltered.vcf.gz
A VCF file containing only indels can also be filtered. Here is how to keep only biallelic indels and also drop positions corresponding to masked bases.
library(VariantAnnotation)
# Identify input file and choose VCF files to read
vcfFileName <- "global.chrXXI.snp.vcf.gz"
param <- ScanVcfParam(fixed = c("ALT"), geno = NA, info = NA)
# Read VCF file into R
vcf <- readVcf(file = vcfFileName,
param = param)
# Show the number of indels
nrow(vcf)
# View indels
rowRanges(vcf)
# To locate biallelic indels, count number of entries in the ALT field
# (invariants will have a single entry, though blank)
altAlleles <- CharacterList(alt(vcf))
nAltAlleles <- sapply(altAlleles, length)
# Frequency of ALT allele numbers
table(nAltAlleles)
# Keep only biallelic indels
vcf <- vcf[nAltAlleles == 1]
# Number of SNPs remaining
nrow(vcf)
# To locate indels at masked sites, load R object repeatMaskedBases
load("stickleback_v5_RepeatMaskedBases.rdd")
# Pull out repeat regions in the chromosome of interest (here, chrXXI)
repeats <- repeatMaskedBases[["chrXXI"]]
if(length(repeats) > 0){
# Make an object containing the start and end positions of repeats
repeatMaskedRanges <- IRanges(start = repeats$start, end = repeats$end)
# Make a similar object with start and end positions of indels
indelRanges <- IRanges(start = start(vcf), end = end(vcf))
# Find which indels overlap the repeat regions
z1 <- findOverlaps(query = repeatMaskedRanges, subject = indelRanges)
z2 <- unique(subjectHits(z1))
# Remove them
if(length(z2) > 0) vcf <- vcf[-z2]
}
# Number of SNPs remaining
nrow(vcf)
# Tabulate remaining indel positions to a text file
chrom <- as.character(seqnames(vcf))
pos <- start(vcf)
write.table(paste(chrom, pos, sep = "\t"), file = "global.chrXXI.keep.txt",
row.names = FALSE, col.names = FALSE, quote = FALSE, sep = "\n")
Finally, extract and write the list of saved indels to a new VCF file
using vcftools
in unix.
vcftools --gzvcf global.chrXXI.indel.vcf.gz \
--positions global.chrXXI.keep.txt --recode --recode-INFO-all \
--out global.chrXXI.keep
bgzip global.chrXXI.keep.recode.vcf
mv global.chrXXI.keep.recode.vcf.gz global.chrXXI.indel.Rfiltered.vcf.gz
tabix -p vcf global.chrXXI.indel.Rfiltered.vcf.gz
To identify (and then exclude) samples with a high proportion of
missing genotypes, use the following commands. In the example below, the
results are outputted to a text file named
global.chrXXI.snp.Rfiltered.imiss
. The column of interest
is F_MISS
, the fraction of missing genotypes for every
sample. This example computes the fraction for one chromosome only.
vcftools --gzvcf global.chrXXI.snp.Rfiltered.vcf.gz \
--missing-indv --out global.chrXXI.snp.Rfiltered
The following command calculates the inbreeding coefficient \(F\) for each sample. The results are
outputted to a text file named
global.chrXXI.snp.Rfiltered.het
. This example computes
\(F\) for one chromosome only.
vcftools --gzvcf DATASET.chrM.snp.Rfiltered.vcf.gz \
--het --out DATASET.chrM.snp.Rfiltered
To detect cases of high kinship between samples, first use GATK to merge SNP data for individual chromosomes. The following example for stickleback leaves out chrM and the sex chromosomes chrXIX and chrY.
I use PLINK2 for this purpose, but it has a surprise: any chromosome named “chrX” will be treated as the X chromosome, even if it is actually chromosome 10. PLINK has no option to turn this off. The only available solution is to rename chrX. I show how to rename a chromosome below.
gatk --java-options "-Xmx4G" GatherVcfs \
-I global1330.chrI.snp.Rfiltered.vcf.gz \
-I global1330.chrII.snp.Rfiltered.vcf.gz \
-I global1330.chrIII.snp.Rfiltered.vcf.gz \
-I global1330.chrIV.snp.Rfiltered.vcf.gz \
-I global1330.chrIX.snp.Rfiltered.vcf.gz \
-I global1330.chrUn.snp.Rfiltered.vcf.gz \
-I global1330.chrV.snp.Rfiltered.vcf.gz \
-I global1330.chrVI.snp.Rfiltered.vcf.gz \
-I global1330.chrVII.snp.Rfiltered.vcf.gz \
-I global1330.chrVIII.snp.Rfiltered.vcf.gz \
-I global1330.chrX.snp.Rfiltered.vcf.gz \
-I global1330.chrXI.snp.Rfiltered.vcf.gz \
-I global1330.chrXII.snp.Rfiltered.vcf.gz \
-I global1330.chrXIII.snp.Rfiltered.vcf.gz \
-I global1330.chrXIV.snp.Rfiltered.vcf.gz \
-I global1330.chrXV.snp.Rfiltered.vcf.gz \
-I global1330.chrXVI.snp.Rfiltered.vcf.gz \
-I global1330.chrXVII.snp.Rfiltered.vcf.gz \
-I global1330.chrXVIII.snp.Rfiltered.vcf.gz \
-I global1330.chrXX.snp.Rfiltered.vcf.gz \
-I global1330.chrXXI.snp.Rfiltered.vcf.gz \
-O global1330.merged.snp.Rfiltered.vcf.gz
gatk IndexFeatureFile -I global1330.merged.snp.Rfiltered.vcf.gz
Then use PLINK to calculate King relatedness. Results in the
following example are written to the text file
global1330.merged.snp.Rfiltered.kin0
.
plink2 --vcf global1330.merged.snp.Rfiltered.vcf.gz --set-all-var-ids @:# \
--max-alleles 2 --maf 0.05 --allow-extra-chr --make-bed \
--out global1330.merged.snp.Rfiltered
plink2 --bfile global1330.merged.snp.Rfiltered --allow-extra-chr \
--make-king-table --out global1330.merged.snp.Rfiltered
To extract data for a subset of individuals, you can use GATK’s
SelectVariants
and an expression to identify the chosen
sample names. vcftools
also has commands to accomplish this.
The following example code selects all individuals whose sample names include the label “Benthic”, but excluding Benthics also having the label “Emily”, and additionally leaving out two specific Benthic individuals.
Some SNPs might now be invariant within the subset. Including the
optional --remove-unused-alternates
will remove ALT alleles
in such cases and make them easier to detect. Invariant sites will be
indicated by a “.” in the ALT field of the output VCF file.
gatk --java-options "-Xmx2g" SelectVariants \
-V global.chrXXI.snp.filtered.vcf.gz \
-se "Benthic" \
-xl-se "Emily" \
-xl-sn "PaxBenthic2010.PAX-B-113" \
-xl-sn "PaxBenthic2010.PAX-B-114" \
--remove-unused-alternates \
-O subset.chrXXI.snp.filtered.vcf.gz
vcftools
has a variety of tools to subset SNPs. For example, the commands below
keep only those SNPs genotyped in at least 75% of individuals.
vcftools --gzvcf global.chrXXI.snp.vcf.gz \
--max-missing 0.75 --recode --recode-INFO-all \
--out global.chrXXI.snp75
bgzip global.chrXXI.snp75.recode.vcf
mv global.chrXXI.snp75.recode.vcf.gz global.chrXXI.snp75.vcf.gz
tabix -p vcf global.chrXXI.snp75.vcf.gz
The following commands read a VCF file and creates a SNP table. Works ONLY on single-nucleotide and bi-allelic SNPs (otherwise genotype is set to missing).
Genotypes are coded 0 = missing, 1 = 0/0
, 2 =
0/1
or 1/0
and 3 = 1/1
. Phased
genotypes (e.g., “1|0”) are coded similarly. The rows of the table will
have the samples, and the columns will have the SNPs.
A “0” is assigned either if the genotype is missing or if the SNP is an an invariant and no ALT allele is present in the sample (ALT allele indicated by “.”)
library(VariantAnnotation)
# Read the VCF file
vcfFileName <- "myVcfFile.snp.gz"
param <- ScanVcfParam(fixed = c("ALT"), geno = "GT", info = NA)
vcf <- readVcf(file = vcfFileName, param = param)
# Convert to a SNP table
snps <- genotypeToSnpMatrix(vcf) # rows = samples, columns = snps
z <- apply(snps$genotypes@.Data, 2, function(x) as.integer(x))
rownames(z) <- colnames(geno(vcf)$GT)
snps <- as.data.frame(z)
This method to write (and read) .csv
files is familiar
but slow.
# Write to a csv file
csvFileName <- sub(".vcf.gz$", ".snptable.csv", vcfFileName)
write.csv(snps, file = csvFileName, row.names = TRUE)
# Read the SNP table from the saved .csv file
# Use check.names = FALSE to prevent R from revising SNP and sample names
snps <- read.csv(file = csvFileName, row.names = 1, check.names = FALSE)
Use fwrite()
and fread()
in the
data.table
package for a much faster write/read.
library(data.table)
# Write to a csv file
csvFileName <- sub(".vcf.gz$", ".snptable.csv", vcfFileName)
fwrite(snps, file = csvFileName, sep = ",",
col.names = TRUE, row.names = TRUE)
# Read the SNP table from the saved .csv file
snps <- as.data.frame(fread(file = csvFileName))
rownames(snps) <- snps[, 1]
snps <- snps[, -1]
GATK can also make a table of SNP genotypes from a VCF file, which we
can convert to a numeric code by counting the number of ALT alleles for
each genotype (0, 1 or 2; NA = missing). Note that this number coding is
different from the one resulting from the above method using
VariantAnnotation
in R.
If a SNP is an invariant in the sample (ALT allele indicated by “.”) then all the non-missing genotypes at that position will be coded as “0” rather than as NA.
Here, the VariantsToTable
output table will have the
columns CHROM, POS, REF ALT and GT. This indicates the chromosome name,
SNP position, the nucleotide for the REF allele, the nucleotide for ALT
allele, and then many columns of genotypes, one for each sample
individual.
# in unix
gatk VariantsToTable \
-V myVcfFile.snp.gz \
-F CHROM -F POS -F REF -F ALT -GF GT\
-O myTable.snp.txt
Read the output into R using fread()
from the
data.table
package (much faster than
read.csv()
).
library(data.table)
library(stringr)
# Using fread() from data.table because it is faster than read.csv()
x <- fread(file = "myTable.snp.txt", sep = "\t")
# Convert missing genotypes to missing
x[x == "./."] <- NA
# Create SNP names similar to those by VariantANnotation
snp_name <- paste0(x$CHROM, ":", x$POS, "_", x$REF, "/", x$ALT)
# Recover the sample names
fishid <- sub(".GT$", "", names(x)[5:ncol(x)])
# Code the genotypes as numeric by counting the number of ALT alleles per genotype
snps <- apply(x, 1, function(x){
str_count(x[5:length(x)], coll(x["ALT"]))
})
snps <- as.data.frame(snps)
rownames(snps) <- fishid
names(snps) <- snp_name
# Peek at the results
snps[1:10,1:4]
How to read a fasta genome file into R.
This code is much faster than using seqinr
but it
converts any lower case letters to upper case, which is not always
desireable.
library(Biostrings)
genome <- readDNAStringSet("stickleback_v5_assembly.fa.gz", "fasta")
# Chromosome names
names(genome)
# Peek at the contents
genome
# Extract the sequence from chrI positions 1000 to 2500
myseq <- subseq(genome["chrI"], start= 1000, end= 1000)
myseq
# Show only the sequence information
as.character(myseq)
# This genome file has no ambiguous letter codes
head(alphabetFrequency(genome))
BED files records genomic positions of SNPs using 0-based, half-open intervals. This means the start coordinate of a range is 0-based (first base is 0) and the end coordinate is not included in the interval.
The Bioconductor package rtracklayer
can read ordinary
BED files having seqnames, start and end positions for genomic ranges,
and other optional columns.
import()
will import the BED file into a Bioconductor
GRanges object and will automatically convert positions to 1-based
closed intervals. The start and end coordinates are both included in the
interval, and the first base is 1. If your BED file has a range chr1
1000 2000, it will be imported into a GRanges object as chr1 1001-2000.
as.data.frame()
will convert the GRanges object to a data
frame.
library(rtracklayer)
x <- import("genoData.bed", format = "BED")
geno <- as.data.frame(x)
PLINK BED files come in a trio of files: .bed
,
.bim
, and .fam
created using the
--make-bed
argument in a PLINK command. To read a PLINK BED
file in R, use the read.plink()
function from the
snpStats
package.
For example, to read a file “global.chrXXI.PLINK.bed” that was output
from PLINK2, use the following code. The result will be a data frame
with the following coding:
0 = Missing data
1 = Homozygous for the reference allele (“0/0”)
2 = Heterozygous (“0/1” or “1/0”)
3 = Homozygous for the alternate allele (“1/1”)
library(snpStats)
bedfile <- "global.chrXXI.PLINK.bed"
bimfile <- "global.chrXXI.PLINK.bim"
famfile <- "global.chrXXI.PLINK.fam"
x <- read.plink(bedfile, bim = bimfile, fam = famfile)
geno <- as.data.frame(x$genotypes@.Data)
Alternatively, use PLINK itself to export a BED file to a VCF file
that can be read into R. The VCF file will need to be indexed with
tabix
.
# in unix
plink2 --bfile global.chrXXI.PLINK --export vcf bgz --out global.chrXXI.PLINK
tabix -p vcf global.chrXXI.PLINK.vcf.gz
SNP nucleotide positions in one version of a reference genome can be converted to corresponding positions for the same SNP in a new reference genome if a liftover file has been created.
For example, SNPs discovered using v1 of the stickleback reference
genome (Jones et al 2012;
Nature) can be converted to their corresponding positions in version
5 of the stickleback reference genome using the liftover file,
v1_withChrUn_to_v5.chain.txt
, available at Mike White’s genome browser
site.
Note: this file does not include liftovers for chrM, which are unchanged. It also does not include chrY, which was not part of the v1 genome.
To demonstrate, I convert from v1 coordinates to v5 coordinates the SNPs in the array developed by Jones et al. 2012; Current Biology
First, read the v1 to v5 liftover file.
library(rtracklayer)
library(GenomicRanges)
# Read liftover file
v1.to.v5_chain <- import.chain("v1_withChrUn_to_v5.chain.txt")
# Check chromosome names
names(v1.to.v5_chain)
Next, read the SNP positions in genome v1. I used the
.csv
file kingsley_snps.csv
,
which contains information on 1491 stickleback SNPs used in the array by
Jones et al. and obtained from the Supplement of that paper and from
NCBI where the data are deposited.
The commands below put the SNP positions into a genomic ranges object (containing start and end positions of genomic intervals) with the SNP positions according to the v1 reference genome.
# Read the file
kingsley_snps <- read.csv("kingsley_snps.csv")
head(kingsley_snps)
# Number of snps
nrow(kingsley_snps)
# Extract the relevant columns - strand is optional and can be left out
v1snps <- data.frame(seqnames = kingsley_snps$chr,
start = kingsley_snps$POS, end = kingsley_snps$POS,
strand = kingsley_snps$strand)
# Make the genomic ranges object with start and end positions of each snp
v1Coordinates <- makeGRangesFromDataFrame(v1snps)
v1Coordinates
Finally, convert the SNP positions to genome v5 coordinates.
The next command carries out the conversion. Each row of
v1Coordinates
correspond to a list element in
v5Coordinates
. In this example, 19 of 1491 SNPs could not
be converted to the new coordinates and 17 SNPs on chrM were
ignored.
# Convert from v1 to v5 positions
v5Coordinates <- liftOver(v1Coordinates, v1.to.v5_chain)
To coerce result to a data frame, use the following (will drop
missing values). The variable group
will correspond to the
rows of the original v1 data frame.
v5snps <- as.data.frame(v5Coordinates)
nrow(v5snps)
head(v5snps)
Do this, for example, before using PLINK2 to avoid issues with chromosome names (PLINK2 will assume that anything named “chrX” is the X chromosome, even if it is just the name you’ve given to the 10th autosome.
The first method uses R. The following code renames the chromosome
“chrX” to “chr10” in a VCF object. The code assumes you’ve already read
the VCF file into memory in an object named vcf
using
readVcf()
command from the VariantAnnotation
package.
# Grab the current chromosome names to a vector
chrNames <- seqlevels(vcf)
# Change the name of chrX in the vector
chrNames[chrNames == "chrX"] <- "chr10"
# Use the changed vector to rename the chromosome in the VCF object
vcf <- renameSeqlevels(vcf, chrNames)
# Check
seqlevels(vcf)
The second method uses bcftools
in unix. First, you need
to create a text file newChrNames.txt
containing
“oldChrName newChrName” pairs, with the old and new name for each
chromosome on a line separated by a white space, and each chromosome on
a separate line in the file.
bcftools annotate --rename-chrs newChrNames.txt global1330.merged.snp.vcf.gz\
-Oz -o global1330.renamedChr.snp.vcf.gz
tabix -p vcf global1330.renamedChr.snp.vcf.gz
The following code writes a VCF object named vcf
in R to
a VCF file. The new file will be compressed and indexed.
# Write to a new VCF file
writeVcf(vcf, file = "newfile.vcf", index = TRUE)
# If you want to rename the output to .gz and .tbi:
system(paste("mv", "newfile.vcf.bgz", "newfile.vcf.gz"))
system(paste("mv", "newfile.vcf.bgz.tbi", "newfile.vcf.gz.tbi"))
There are many ways to do PCA in R. However, one needs a method to account for missing genotypes.
One solution to the problem of missing genotypes is to impute missing genotypes, but this isn’t necessary unless there are very many missing values.
PCA in PLINK2 works for autosomes only but is really fast. PCA using the earlier version PLINK v1.9 is described at Ravinet and Meier’s helpful speciation genomics web site
The method has a surprise: any chromosome named “chrX” will be treated as the X chromosome, even if it is actually chromosome 10. PLINK2 has no option to turn this off. The only available solution is to rename chrX. I show how to rename a chromosome here.
PLINK2 obtains principal components from a variance-standardized relationship matrix (which is like a correlation matrix). It handles missing genotypes by calculating the variance-standardized relationship for each pair of individuals using just the non-missing SNPs they share. This is fine if genotypes are missing at random, but it is difficult to know this.
The calculation is sensitive to minor allele frequency, so an
maf
(minor allele frequency) cutoff is typically
specified.
Here are the commands if running Compute Canada software. The
argument --max-alleles 2
is redundant if you’ve already
filtered out multi-allelic sites.
# Make BED file and other files needed for pca
plink2 --vcf global.chrXXI.snp.Rfiltered.vcf.gz --set-all-var-ids @:# \
--max-alleles 2 --maf 0.01 --allow-extra-chr --make-bed \
--out global.chrXXI.snp.Rfiltered
# Carry out PCA
plink2 --bfile global.chrXXI.snp.Rfiltered --pca biallelic-var-wts --allow-extra-chr \
--out global.chrXXI.snp.Rfiltered
PLINK was designed for analyses of the human genome. The argument
--set-all-var-ids
allows nonstandard names for snps;
--allow-extra-chr
permits non-standard chromosome names;
king-cutoff
sets a cutoff for relatedness measured by
kinship (duplicate samples have kinship 0.5, whereas 0.25 is the
expected value for parent-child or full-sib); --maf
sets a
cutoff for minor allele frequency.
Plot the PCA scores in R.
# in R
pca <- read.table("global.chrXXI.snp.Rfiltered.eigenvec",
sep = "\t", header = TRUE, comment.char = "")
plot(PC2 ~ PC1, data = pca)
# Eigenvalues
cat global.chrXXI.snp.Rfiltered.eigenval
# SNP loadings
pca.loadings <- read.table("global.chrXXI.snp.Rfiltered.eigenvec.var"),
sep = "\t", header = TRUE, comment.char = "")
PLINK2 will prune markers in high linkage disequilibrium with one another. Whether this is desirable depends on your project.
Heed my above warning about PLINK2 treating any chromosome named “chrX” as the X chromosome, even if it is actually chromosome 10. To rename a chromosome see here.
The argument --indep-pairwise
is the command that
performs linkage pruning. The first number, 50, sets a window size of 50
Kb. The second argument, 10 is the window step size. The third sets an
\(r^2\) threshold for the amount of
linkage considered acceptable. Here we prune SNPs that have an \(r^2\) greater than 0.3.
# Make BED and other files
plink2 --vcf global.chrXXI.snp.Rfiltered.vcf.gz --set-all-var-ids @:# \
--max-alleles 2 --maf 0.01 --indep-pairwise 50 10 0.3 \
--allow-extra-chr --make-bed --out global.chrXXI.snp.Rfiltered
# PCA
plink2 --bfile global.chrXXI.snp.Rfiltered \
--extract global.chrXXI.snp.Rfiltered.prune.in --pca --allow-extra-chr \
--out global.chrXXI.snp.Rfiltered
Plot the PCA scores in R.
# in R
pca <- read.table("global.chrXXI.snp.Rfiltered.eigenvec",
sep = "\t", header = TRUE, comment.char = "")
plot(PC2 ~ PC1, data = pca)
pcadapt
pcadapt
is an R package that uses PCA to detect local adaptation. You may have
to install it first.
In the first step, use PLINK2 to convert genotype data in a VCF file to the BED file format (here shown without using pruning). Heed my earlier warning that PLINK2 will treat any chromosome named “chrX” as the X chromosome, even if it is actually chromosome 10. To rename a chromosome see here.
# Make BED and other files
plink2 --vcf global.chrXXI.snp.Rfiltered.vcf.gz --set-all-var-ids @:# \
--max-alleles 2 --maf 0.01 --allow-extra-chr --make-bed \
--out global.chrXXI.snp.Rfiltered
Then run pcadapt
in R.
library(pcadapt)
library(Variantnnotation)
# Grab individual sample names
x <- read.table(famfile.in)
ID <- x[, 2]
# Optional commands to make grouping variable from ID.
# Here I extract info from names to indicate ecotype (i.e., benthic or limnetic)
# Your command will be different if you are using other labels or population types
ecotype <- rep("limnetic", length(ID))
ecotype[grep("enthic", ID)] <- "benthic"
# Read SNP names from PLINK bim file
snp <- read.table("global.chrXXI.snp.Rfiltered.bim")
names(snp) <- c("chr", "SNP", "cM", "position", "ALT", "REF")
# PCA with 10 components saved
bed <- read.pcadapt("global.chrXXI.snp.Rfiltered.bed", type = "bed")
PCA <- pcadapt(input = bed, K = 10, min.maf = 0.01)
names(PCA)
# [1] "scores" "singular.values" "loadings" "zscores"
# [5] "af" "maf" "chi2.stat" "stat"
# [9] "gif" "pvalues" "pass"
# Obtain PCA loadings (NA means SNP was an invariant or otherwise dropped)
pca.loadings <- as.data.frame(PCA$loadings, row.names = snp$SNP)
names(pca.loadings) = paste0("PC", 1:ncol(PCA$loadings))
# Obtain PCA scores
pca.scores <- as.data.frame(PCA$scores, row.names = ID)
colnames(pca.scores) <- paste0("PC", 1:ncol(pca.scores))
# scree plot of eigenvalues
plot(PCA, option = "screeplot")
# plot of PC scores
plot(PCA, option = "scores", pop = ecotype)
plot(PCA, option = "scores", pop = lake)
plot(PCA, option = "scores", i = 3, j = 4, pop = lake)
This section shows methods to compare populations or species in allele frequencies and other metrics for a fixed block or window of the genome, emphasizing methods available in R.
We begin by specifying the range of bases to be included in the window and then extract the corresponding genotypes from a VCF file.
library(VariantAnnotation)
# Choose which chromosome
chrname <- "chrXXI"
# Choose the SNP vcf file
vcfFileName <- "global.chrXXI.snp.Rfiltered.vcf.gz"
# Choose which populations to compare
# Here I'm choosing two groups from the Little Campbell River
sampleNames <- scanBcfHeader(file = vcfFileName)[[1]]$Sample
keepSamples <- sampleNames[grep("campbell", sampleNames, ignore.case = TRUE)]
head(keepSamples)
# Set start and end base pairs of window (here, a 10K base window)
range <- c(2000001, 2010000)
# Read the vcf block
param <- ScanVcfParam(fixed = NA, geno = c("GT"), info = NA,
samples = keepSamples,
which = GRanges(chrname, IRanges(start = range[1], end = range[2])))
vcf <- readVcf(file = vcfFileName, genome = "stickleback_v5_assembly.fa.gz",
param = param)
# Extract genotypes and transpose
GT <- as.data.frame(t(geno(vcf)$GT))
rm(vcf)
# Make a unique label for the populations
# Here, the first 6 letters of the names are enough
pop <- substr(rownames(GT), 1, 6)
popfreq <- table(pop)
popfreq
# Marine Stream
# 69 9
# Identify SNPs that are invariants in this sample
is.invariant <- sapply(GT, function(x){
x <- gsub("\\||\\/", "", x)
length(unique(x[x != ".."])) == 1
})
table(is.invariant)
# Drop the invariant sites
GT <- GT[, !is.invariant]
# Replace missing genotype symbol with NA
GT[GT == "./."] <- NA
# Keep only SNPs with >= half individuals genotyped in each pop
ngeno <- aggregate(GT, by = list(pop), FUN = function(x){sum(!is.na(x))})[,-1]
test <- apply(ngeno, 2, function(x){ all(x > 0.5 * popfreq)})
GT <- GT[, test]
This code calculates the number and frequency of alleles in each
group in the window. This code is for bi-allelic SNPs only. To begin,
run the commands above to obtain pop
and GT
for a predetermined window of bases, dropping the invariant sites,
replacing missing genotype symbols with NA, and and dropping SNPs with
too few genotyped individuals per population.
library(stringr)
# Calculate number of REF alleles in each genotype, by population
nref <- aggregate(GT, by = list(pop),
FUN = function(GT){sum(str_count(GT, pattern = "0"), na.rm = TRUE)})
rownames(nref) <- nref[,1]
# Calculate number of ALT alleles in each genotype, by population
nalt <- aggregate(GT, by = list(pop),
FUN = function(GT){sum(str_count(GT, pattern = "1"), na.rm = TRUE)})
rownames(nalt) <- nalt[,1]
# Calculate ALT allele frequency, by population
alleleFreq <- nalt[,-1]/(nref[,-1] + nalt[,-1])
# Mean absolute allele frequency difference in window
alleleFreqDiffMean <- mean(unlist((abs(alleleFreq[1,] -
alleleFreq[2,]))), na.rm = TRUE)
alleleFreqDiffMean
The Cluster Separation Score (CSS) was introduced by Jones et al 2012; Nature as a window-based metric of divergence between two populations. It measures genetic difference between all pairs of individuals as euclidean distance in two dimensions obtained from an MDS analysis of proportion sequence divergence. The CSS score is the average distance between pairs of individuals belonging to different populations minus the average distance between pairs belonging to the same populations
The following code is a slight variation of the original method in
Jones et al. To begin, run the commands above to obtain pop
and GT
for a predetermined window of bases, dropping the
invariant sites, replacing missing genotype symbols with NA, and
dropping SNPs with too few genotyped individuals per population.
CSS is a measure of divergence between groups and it grows with the size of the window and with increasing number of snps. To make the measurement comparable between different windows, divide CSS by the total number of called bases in each window of predetermined quality (including snps, indels, and invariants).
# Sort the rows of GT
GT <- GT[order(pop), ]
pop <- pop[order(pop)]
poptype <- sort(unique(pop))
if(length(poptype) != 2) stop("CSS only works for two population types")
# Count number of ALT alleles per genotype (snps become rows)
genotypes <- apply(GT, 1, str_count, pattern = "1")
# List all possible pairs of individuals
colpairs <- combn(1:ncol(genotypes), 2)
# List pops of all possible pairs of individuals
gtpairs <- combn(pop, 2, FUN = function(x){paste(x, collapse=",")})
# Identify within and between-population pairs in pairwise combinations
ii <- gtpairs == paste(poptype[1], poptype[1], sep = ",") # cases of "1,1"
jj <- gtpairs == paste(poptype[2], poptype[2], sep = ",") # cases of "2,2"
ij <- gtpairs == paste(poptype[1], poptype[2], sep = ",") # cases of "1,2"
ji <- gtpairs == paste(poptype[2], poptype[1], sep = ",") # cases of "2,1"
m <- sum(pop == poptype[1])
n <- sum(pop == poptype[2])
# Calculate a measure of proportion sequence divergence between
# all pairs of individuals at every snp:
# 0 = same genotype, 1 = different genotype, 0.5 = a shared allele
psd <- apply(colpairs, 2, function(x){
psd <- abs(genotypes[ , x[1]] - genotypes[ , x[2]])/2
})
colnames(psd) <- gtpairs
# What to do with a missing value for psd?
# Here, I replace missing psd with the mean psd of its poptype comparison:
# The mean of psd between individuals belonging to poptype 1, or
# the mean of psd between individuals belonging to poptype 2, or
# the mean of psd between individuals belonging to different poptypes.
# Assign the average psd's to cases of psd = NA
psdUnique <- unique(colnames(psd))
for(p in psdUnique){
doCols <- c(grep(p, colnames(psd)))
rowMean <- apply(psd[ , doCols], 1, mean, na.rm=TRUE)
for(k in doCols){
psd[is.na(psd[ , k]) , k] <- rowMean[is.na(psd[ , k])]
}
}
# Pairwise differences summed across the window
psdSum <- colSums(psd)
# Initialize distance matrix
d <- matrix(0, nrow = length(pop), ncol = length(pop))
# Place pairwise distances into distance matrix
d[lower.tri(d,diag = FALSE)] <- psdSum
d <- t(d)
d[lower.tri(d,diag = FALSE)] <- psdSum
rownames(d) <- pop
colnames(d) <- pop
# Obtain first 2 MDS axes
z <- cmdscale(d, k = 2, add = TRUE)$points
# Matrix of pairwise distance between individuals on MDS axes
euclid <- as.matrix(dist(z, method = "euclidean"))
euclid.lower <- euclid[lower.tri(euclid)]
# Sum of differences within and between groups
sii <- sum(euclid.lower[ii])
sjj <- sum(euclid.lower[jj])
sij <- sum( sum(euclid.lower[ij]), sum(euclid.lower[ji]), na.rm=TRUE )
# CSS score
CSS <- ( sij/(m*n) ) - (1/(m+n))*( sii/((m-1)/2) + sjj/((n-1)/2) )
CSS
Here’s how we calculate Fst (method of Weir and Cockerham 1984) using
the hierfstat
package. The advantage of this package is
that it provides all the variance components needed to calculate Fst.
The calculation is not fast, however. The fst_WC84()
function in the assigner
package is said to be much faster.
Using it instead requires additional steps and packages to get the data
in the right format (not shown here).
Fst is a ratio of the variance among groups divided by the total variance. To get Fst for a window, calculate it as the ratio of the sums of the variance components across SNPs (not the average of the Fst’s). Use the same approach to calculate Fst across multiple windows and the whole genome.
To begin, run the commands above to obtain pop
and
GT
for a predetermined window of bases, dropping the
invariant sites, replacing missing genotype symbols with NA, and
dropping SNPs with too few genotyped individuals per population. The
code below assumes biallelic SNPs only.
library(hierfstat)
# Convert genotypes to hierfstat format - code assumes biallelic SNPs only
genotypes <- GT
genotypes[genotypes == "0/0" | genotypes == "0|0"] <- 11
genotypes[genotypes == "0/1" | genotypes == "1/0"] <- 12
genotypes[genotypes == "0|1" | genotypes == "1|0"] <- 12
genotypes[genotypes == "1/1" | genotypes == "1|1"] <- 22
# Convert character data to numeric
# (note: rownames are lost in the process)
genotypes <- data.frame(lapply(genotypes, as.integer))
# Group ID also needs to be numeric
group <- as.integer(as.factor(pop))
# Weir-Cockerham Fst
z <- wc(cbind(group, genotypes))
# Contents of Fst object
names(z)
# [1] "call" "sigma" "sigma.loc" "per.al" "per.loc" "FST" "FIS"
# Window Fst
z
# $FST
# [1] 0.8322499
# $FIS
# [1] 0.1918404
# z$sigma.loc contains the variance components per snp
head(z$sigma.loc)
# lsiga lsigb lsigw
# chrXXI.2000027_G.T -0.0003637520 7.696227e-05 0.01315789
# chrXXI.2000036_C.T -0.0003732924 8.085008e-05 0.01351351
# chrXXI.2000061_G.A -0.0004972829 8.852259e-05 0.01388889
# chrXXI.2000066_A.G 0.0021377646 -5.930962e-04 0.02666667
# chrXXI.2000100_G.A -0.0004333496 8.680556e-05 0.01388889
# chrXXI.2000108_G.T 0.0795661414 1.245831e-02 0.02739726
# Sum the variance components across SNPs in the window
varCompSum <- colSums(z$sigma.loc)
varCompSum
# lsiga lsigb lsigw
# 57.698847 2.231086 9.398820
# Window Fst
varCompSum$lsiga / sum(varCompSum)
# [1] 0.8322499
The numerator of Fst (calculated in the previous section) is a
measure of absolute divergence between populations, similar to dxy. As a
window-based measure of divergence, divide the sum of
varCompSum$lsiga
by the total number of called bases in
each window of predetermined quality (including snps, indels, and
invariants).
Instructions here use Beagle to impute missing genotypes using identity by descent.
To begin, make sure to save only biallelic genotypes to a VCF file in which missing genotypes are coded as “./.” and not “0/0”.
It would be wise to reduce the VCF file to a subset of SNPs successfully genotyped in a high fraction of individuals. The commands below keep SNPs genotyped in at least 75% of individuals.
vcftools --gzvcf global.chrXXI.snp.Rfiltered.vcf.gz \
--max-missing 0.75 --recode --recode-INFO-all \
--out global.chrXXI.snp.Rfiltered75
bgzip global.chrXXI.snp.Rfiltered75.recode.vcf
mv global.chrXXI.snp.Rfiltered75.recode.vcf.gz global.chrXXI.snp.Rfiltered75.vcf.gz
tabix -p vcf global.chrXXI.snp.Rfiltered75.vcf.gz
Now, impute the missing genotypes. The commands below for genotpe imputation are those used on a Compute Canada (Digital Alliance) machine. Don’t forget to load the module first. To speed up the calculations I requested 16 cpu’s with 2G of memory per cpu (32G in total).
The following command imputes genotypes using the existing genotypes. The output is a VCF file with imputed genotypes.
java -Xmx31g -jar ${EBROOTBEAGLE}/beagle.22Jul22.46e.jar nthreads=16 \
gt=global.chrXXI.snp.Rfiltered75.vcf.gz out=global.chrXXI.snp.Rfiltered75.imputed
gatk IndexFeatureFile -I global.chrXXI.snp.Rfiltered75.imputed.vcf.gz
Beagle will also impute genotypes using the genotype likelihoods
instead. However, you’ll first need to convert any missing likelihoods.
The zcat
sequence of commands below replaces missing
likelihoods with “0,0,0”.
zcat global.chrXXI.snp.Rfiltered75.vcf.gz | sed 's/\\.,\\.,\\./0,0,0/g' | \
bgzip > global.chrXXI.snp.edited75.vcf.gz
gatk IndexFeatureFile -I global.chrXXI.snp.edited75.vcf.gz
java -Xmx31g -jar ${EBROOTBEAGLE}/beagle.22Jul22.46e.jar nthreads=16 \
gt=global.chrXXI.snp.edited75.vcf.gz out=global.chrXXI.snp.Rfiltered75.imputed
gatk IndexFeatureFile -I global.chrXXI.snp.Rfiltered75.imputed.vcf.gz
The rest of this page lists the steps that we use to generate a VCF
file starting with fastq
files and using a pipeline based
on GATK4 best practices.
For stickleback we use the v5 reference genome, a Paxton Lake
benthic, available at Mike White’s genome browser
site. The downloaded file stickleback_v5_assembly.fa.gz
is a gzipped fasta file containing the v5 assembly of the stickleback
genome. It includes sequence for 24 components: 20 autosomes, the X
chromosome (chrXIX), the Y chromosome (chrY), the mitochondrion (chrM),
plus “chrUn”, which is an artificial chromosome consisting of
unassembled contigs chained together (individual contigs are separated
by a large number of NNNNNNNNNs). To peek at the fasta file, use the
unix command
zless stickleback_v5_assembly.fa.gz
The reference genome file need to be indexed before using. Unzip the fasta file. Here’s the command for stickleback.
gunzip stickleback_v5_assembly.fa.gz
To index on Compute Canada:
# Generate bwa index
bwa index -a bwtsw stickleback_v5_assembly.fa
# Generate .fai index files needed by GATK
samtools faidx stickleback_v5_assembly.fa
# Generate .dict file needed by GATK
gatk --java-options "-Xmx2g" CreateSequenceDictionary \
-R stickleback_v5_assembly.fa
R can be used to examine number and quality of short reads.
Paired short reads for an individual are in two large text files in
.fastq.gz
format. For paired reads, the first file (file
name with “R1” or just “1”) will have the forward reads, and the second
file (file name with “R2” or “2”) will contain reverse complement
reads.
The fastq.gz
files are generally huge, so the R code
below takes a large random sample of reads instead (the
ShortRead
manual says the default number of reads is 1
million, but I don’t think so).
In R:
library(ShortRead)
# choose a fastq file to inspect
myFileName <- dir(getwd(), "fishName.R1.fastq.gz")
# Take large random sample of reads
reads <- qa(myFileName)
# Show total number of reads read
reads[["readCounts"]]
# Show base frequencies
reads[["baseCalls"]]
# Show mean and median base quality
z <- rep(as.integer(rownames(reads[["baseQuality"]])),
reads[["baseQuality"]]$count)
mean(z, na.rm = TRUE)
median(z, na.rm = TRUE)
# Plot read quality by cycle
perCycle <- reads[["perCycle"]]
ShortRead:::.plotCycleQuality(perCycle$quality)
# Create summary report
report(reads, dest = paste(myFileName, "QAreport", sep = "."), type = "html")
The report will be placed in a folder named after the
fastq.gz
files analyzed. Download the folder to your local
machine and double click the index.html
file inside the
folder to view all the plots in your browser and print to a pdf.
We use bwa
to
align paired reads to the reference genome. The alignments are saved in
uncompressed .sam
files.
bwa
commandThe basic bwa
command is as follows (the backslash
continues a single command on the next line).
On Compute Canada:
bwa mem -M stickleback_v5_assembly.fa fishName_R1.fastq.gz \
fishName_R2.fastq.gz > fishName.sam
We can speed up this step. A single bwa
job can be
divided up into multiple threads that are run concurrently on separate
cores. Here, the example command below divides the process among 16
cores, speeding up the whole operation many fold. Most nodes on Compute
Canada machines have 32 cores, so you can go up to 32 threads.
Here’s the basic command if running on Compute Canada in interactive
mode (the salloc
command requests 16 cores (CPU’s) and 1G
of memory per core).
# in unix
salloc --time=3:0:0 --ntasks-per-node=16 --mem-per-cpu=1G --account=def-schluter
# Wait for the resources to be made available to you.
# Then type the following (substituting the name of your actual fish)
bwa mem -M -t 16 stickleback_v5_assembly.fa fishName_R1.fastq.gz \
fishName_R2.fastq.gz > fishName.sam
exit # end interactive session
On the ZCU Cluster, no advance request for resources is required. The command here is
/Linux/bin/bwa-0.7.15 mem -M -t 16 stickleback_v5_assembly.fa fishName_R1.fastq.gz \
fishName_R2.fastq.gz > fishName.sam
Here we take each sam file of aligned reads through the bulk of the GATK4 procedures. We use the steps recommended for “germline short variant discovery (snps + Indels)”. See the tool documentation for the version of GATK you are using.
The output of this sequence of steps is a VCF file containing all the joint genotype calls.
Snp calling involves the following steps.
SortSam
to sort the reads in the sam file.
MarkDuplicates
to identify duplicate reads (drop
this step if doing GBS). AddOrReplaceReadGroups
to
assign reads to a single new read group.
BaseRecalibrator
to make recalibration table for
base quality scores (optional). ApplyBQSR
recalibrates
the base quality scores (optional).
HaplotypeCaller
to call SNPs for individual
specimens. GenomicsDBImport
to construct database of
gVCF files. GenotypeGVCFs
to genotype all samples
jointly.
Base recalibration (last two steps) is optional but we use it in the standard stickleback pipeline You’ll need to ensure that the following files are in your working directory.
stickleback_21genome_snp_v5.1.bed
stickleback_21genome_snp_v5.1.bed.idx
I also use this shorter list of known snps.
knownSnpsv5.1.vcf
knownSnpsv5.1.vcf.idx
These files contain lists of known SNPs in the stickleack genome. The
file stickleback_21genome_snp_v5.1.bed
lists the
highest-confidence SNPs discovered in the study of 21 stickleback
genomes sampled from marine and freshwater sites around the northern
hemisphere by Jones et al
2012; Nature. The original file was provided by Felicity Jones and
lifted over to the v5 genome coordinates. The second list of SNPs was
discovered by Jones
et al. 2012; Current Biology obtained from the Supplement of that
paper and from NCBI where the data are deposited. They were lifted over
to the new v5 coordinates as described below
Here’s the sequence of commands applied to a single fish whose sam
file name is GrowthHormoneTransgenicLCR_A-line.GHT-A1.sam
.
I had to go up to 24Gb to avoid crashes.
The flag Xmx22g
specifies the maximum amount of memory
that can be used. Set the number equal to the amount of memory requested
for the job minus a little for overhead.
Here’s what the commands would look like in an interactive session on Compute Canada.
# Request resources for interactive session
salloc --time=12:0:0 --ntasks-per-node=1 --mem-per-cpu=24G --account=def-schluter
# You'll have to wait for the resources to become available to continue
gatk --java-options "-Xms256m -Xmx22g -XX:ParallelGCThreads=8" SortSam -I FISHNAME.sam \
-O FISHNAME.sorted.bam \
-SO coordinate --CREATE_INDEX TRUE
gatk --java-options "-Xms256m -Xmx30g -XX:ParallelGCThreads=8" MarkDuplicates \
-I FISHNAME.sorted.bam \
-O FISHNAME.mkdup.bam \
-M FISHNAME.mkdup.metrics \
--REMOVE_DUPLICATES FALSE --ASSUME_SORT_ORDER coordinate
gatk --java-options "-Xms256m -Xmx4g" AddOrReplaceReadGroups \
-ID GenomeBC.FISHNAME -LB FISHNAME.SB -SM FISHNAME -PL ILLUMINA \
-PU GenomeBC -I FISHNAME.mkdup.bam -O FISHNAME.rg.bam \
-SO coordinate --CREATE_INDEX TRUE
# Optional base quality score recalibration:
gatk --java-options "-Xms256m -Xmx4g" BaseRecalibrator \
-R stickleback_v5_assembly.fa -I FISHNAME.rg.bam \
--known-sites knownSnpsv5.1.vcf \
--known-sites stickleback_21genome_snp_v5.1.bed \
-O FISHNAME.recal.table
gatk --java-options "-Xms256m -Xmx4g" ApplyBQSR -R stickleback_v5_assembly.fa \
-I FISHNAME.rg.bam \
--bqsr-recal-file FISHNAME.recal.table \
-O FISHNAME.recal.bam
exit # end interactive session
The bam files created in the last step contain a lot of information about read pair numbers and coverage along chromosomes. Here’s how to use R to extract some of this information.
The bam
files can be huge. The code below examines one
chromosome at a time, using chrXXI as an example. You might need to run
this code in an interactive session if the memory demands are too
great.
library(Rsamtools)
library(bitops)
# select the bamfile and chromosome to use.
bamfile <- "myBamFile.bam"
chr <- "chrXXI"
baifile <- sub("bam$", "bai", bamfile) # name of bam index file
# obtain chromosome length from genome index
faifile <- read.table("stickleback_v5_assembly.fa.fai", row.names = 1)
chrlength <- faifile[chr,2]
# read fields from the bam file
x <- scanBam(bamfile, index = baifile,
param = ScanBamParam(what = c("pos","qwidth","flag","strand","isize"),
which = GRanges(seqnames = chr, IRanges(1, chrlength)) ))[[1]]
# Drop unaligned reads
ind <- !is.na(x[["pos"]])
x <- lapply(x, function(x){x[ind]})
# Calculate coverage of mapped reads
ranges1 <- IRanges(start = x$pos, width = x$qwidth,
names=make.names(x[["qname"]], unique = TRUE))
cover <- as.vector(coverage(ranges1))
mean(cover) # mean coverage over all bases
median(cover) # median coverage
# Plot the frequency distribution of coverage.
hist(cover[cover <= 100], right = FALSE, breaks = 50,
main="Coverage", col="firebrick")
# Number of mapped reads.
x[["mapped"]] <- bitAnd(x[["flag"]], 0x0004) != 0x0004
x[["mappedmate"]] <- bitAnd(x[["flag"]], 0x0008) != 0x0008
x[["mapped.in.properpair"]] <- bitAnd(x[["flag"]], 0x0002) == 0x0002
sum(x[["mapped"]]) # No. mapped reads
table(x[["strand"]]) # No. mapped reads on "+" and "-" strands
sum(x[["mapped"]] & x[["mappedmate"]]) # No. reads whose mate also mapped
sum(x[["mapped.in.properpair"]]) # No. reads mapped in a proper pair
# Plot of position of mapped reads along the chromosome
# (the 1-based leftmost position of each query on the reference sequence)
hist(x$pos/10^6, right=FALSE, col="firebrick",
breaks = 200, xlab = "Position (million bases)")
# Alignment sequence (query) width.
median(x[["qwidth"]], na.rm = TRUE)
# Plot of "insert" or "template" length, the number of bases
# from the leftmost mapped base to the rightmost mapped base.
# The leftmost segment has a plus sign and the rightmost has a minus sign.
hist(x$isize[abs(x$isize) <= 700], right = FALSE,
col = "firebrick", breaks = 100)
This step uses GATK’s HaplotypeCaller
to call SNPs for
an individual, separately for each chromosome. Commands below use chrXXI
as an example. Each chromosome and individual yields a compressed gVCF
file. This step can require significant resources for large chromosomes
and probably needs to be run in batch mode.
gatk --java-options "-Xmx4g" HaplotypeCaller \
-R stickleback_v5_assembly.fa -ERC GVCF \
-I FISHID.recal.bam \
-O FISHID.chrXXI.g.vcf.gz \
-L chrXXI \
--pcr-indel-model NONE \
--heterozygosity 0.01 --indel-heterozygosity 0.00125 \
--output-mode EMIT_ALL_ACTIVE_SITES \
-A DepthPerAlleleBySample -A FisherStrand -A InbreedingCoeff \
-A MappingQuality -A MappingQualityRankSumTest \
-A QualByDepth -A ReadPosRankSumTest
Next, create the database of gVCF files. The database for chrXXI will
be located in a folder named chrXXIdb
. If this folder
already exists, delete it before running.
The task requires a sample map, a tab-delimited text file with two
columns. The first column contains the sample names (fish id’s) line by
line. The second column contains the gVCF file names. Here are the first
few lines of an example file named chrXXI.map
to analyze
the chromosome chrXXI.
Marine-Pac-SaritaBay_BC.SAR-29 Marine-Pac-SaritaBay_BC.SAR-29.chrXXI.g.vcf.gz
Solitary-CranbyLake.CRL25 Solitary-CranbyLake.CRL25.chrXXI.g.vcf.gz
Marine-Pac-MudLake_AK.ML-39 Marine-Pac-MudLake_AK.ML-39.chrXXI.g.vcf.gz
Solitary-TroutLake.TRL-2 Solitary-TroutLake.TRL-2.chrXXI.g.vcf.gz
Marine-Pac-DoranPark_CA.CA02-29 Marine-Pac-DoranPark_CA.CA02-29.chrXXI.g.vcf.gz
Marine-Pac-MudLake_AK.ML-25 Marine-Pac-MudLake_AK.ML-25.chrXXI.g.vcf.gz
...
The following R commands can help to make the map file if your file
names have the format FISHID.chrXXI.g.vcf.gz
# Grab all files whose names end with "chrXXI.g.vcf.gz"
z <- list.files(pattern = "chrXXI.g.vcf.gz$")
# This works if the rest of the file name is the name of the individual
fishname <- sub(".chrXXI.g.vcf.gz", "", z)
# make the map file
chrXXImap <- data.frame(fishname, z)
write.table(chrXXImap, file = "chrXXI.map", quote = FALSE,
col.names = FALSE, row.names = FALSE, sep = "\t")
Now construct the database with GATK.
# in unix
gatk --java-options "-Xmx4g" GenomicsDBImport \
--genomicsdb-workspace-path chrXXIdb \
--sample-name-map chrXXI.map \
-L chrXXI
exit
Finally, use GenotypeGVCFs
to call SNPs and invariant
sites jointly on the samples in your database. Here is code for Compute
Canada to run it on chrXXI using the database chrXXIdb
created in the previous step. In this example, we include invariant as
well as variant sites. The output is sent to a VCF file named
global.chrXXI.vcf.gz
.
NOTE: GATK used to represent uncalled (missing) genotypes as “./.” in
the VCF output of GenotypeGVCFs
. But as of GATK 4.2.3.0, missing
genotypes might appear as “0/0” instead (except in some cases with
phased variants). GATK wants us to know that “You’ll still be able to
determine the genotype as missing because the FORMAT DP field will be 0,
though.” Later, I use a genotype quality metric to designate missing
from non-missing.
# in unix
gatk --java-options "-Xmx4g" GenotypeGVCFs \
-R stickleback_v5_assembly.fa \
-V gendb://chrXXIdb \
-L chrXXI \
--include-non-variant-sites \
--max-alternate-alleles 3 \
--standard-min-confidence-threshold-for-calling 20 \
-O global.chrXXI.vcf.gz
This section describes several tools to extract snps, indels, and invariants from a VCF file and filter them on the basis of quality. It also describes ways to extract a subset of samples from VCF files and drop ALT alleles unused in the subset.
The following command extracts only SNPs (single nucleotide
polymorphisms) and saves the results in a new VCF file. Those SNPs that
failed the phred-scaled confidence threshold for calling variants, as
determined by the argument
--standard-min-confidence-threshold-for-calling
in the
GenotypeGVCFs
command (I used 20, the default is 30), are
given a LowQual value in the FILTER column and are removed by the
--exclude-filtered
argument in the
SelectVariants
command below.
Note that in GATK, SNPs include single nucleotide substitutions as
well as spanning
deletions. The ALT allele for a spanning deletion is indicated as
*
in the VCF file. Most are labeled LowQual and filtered
out. Not all downstream programs will accept spanning deletions.
gatk --java-options "-Xmx2g" SelectVariants \
-V global.chrXXI.vcf.gz \
--select-type-to-include SNP \
--exclude-filtered \
-O global.chrXXI.snp.vcf.gz
The following command extracts only indels (insertions and deletions)
and saves the results in a new VCF file. Those SNPs that failed the
phred-scaled confidence threshold for calling variants are given a
LowQual value in the FILTER column and are removed by the
--exclude-filtered
argument in the command below.
gatk --java-options "-Xmx2g" SelectVariants \
-V global.chrXXI.vcf.gz \
--select-type-to-include INDEL \
--exclude-filtered \
-O global.chrXXI.indel.vcf.gz
The VCF output of this command includes only invariants – neither SNPs nor indels.
Add the argument --exclude-filtered
in the
SelectVariants
command below (as in the commands above to
extract SNPs or indels)) to leave out sites having LowQual in the FILTER
column—these are sites that failed the minimum threshold for calling a
SNP and will have ‘.’ or ’*’ in the ALT column. Here, I did not include
this quality filter and instead prefer to filter invariant genotypes
according to genotype quality at in a later stage in the analysis.
gatk --java-options "-Xmx2g" SelectVariants \
-V global.chrXXI.vcf.gz \
--select-type-to-include NO_VARIATION \
-O global.chrXXI.invar.vcf.gz
Before filtering out low-quality SNPs based on annotations provided in the VCF file, it is a good idea to plot the relevant metrics to get a sense of whether the recommended filtering cutoffs make sense for your data. To do this you’ll need to read some of the contents of the VCF file into R.
A large random sample of SNPs should suffice. If you really want to read the whole VCF file and are running into memory limits (most likely if you are also reading genotype fields too, such as GT and GQ), read the VCF file in chunks, as explained here
The following code (obtained from VCF-tricks) samples 10,000 SNPs (modify below by editing number in ‘shuf -n 10000’). It assumes that the number of header lines (leading lines beginning with “#” in the VCF file) is less than 2000 (modify by editing number in ‘head -n 2000’).
# Number of SNPs in VCF file (number of lines not beginning with '#')
zcat global.chrXXI.snp.filtered.vcf.gz | zgrep -v '^#' | wc -l
# Which is the first non-header line? (prints line number and a few characters)
zgrep -v -m1 -n '^#' global.chrXXI.snp.filtered.vcf.gz | head -c 20
# Take a random sample of 10,000 snps
(zcat global.chrXXI.snp.filtered.vcf.gz | head -n 2000 | \
zgrep '^#' ; zgrep -v '^#' global.chrXXI.snp.filtered.vcf.gz | \
shuf -n 10000 | LC_ALL=C sort -k1,1V -k2,2n) | \
bgzip > global.chrXXI.snp.10K.vcf.gz
tabix -p vcf global.chrXXI.snp.10K.vcf.gz
Now, choose the quality metrics of interest. For example, here I
select the total read depth (DP) of each snp, the Quality by Depth (QD)
score, and the Fisher’s strand bias (FS) score. Remove the ‘info =’
argument entirely to read all available INFO fields. To save memory,
leave out the genotype fields (geno = NA
).
# in R:
library(VariantAnnotation)
# Choose which VCF fields to read
param <- ScanVcfParam(fixed = c("ALT"), geno = NA,
info = c("DP", "QD", "FS"))
Read the specified fields into a vcf object in R.
# Read VCF file into R
vcf <- readVcf(file = "global.chrXXI.snp.10K.vcf.gz",
genome = "stickleback_v5_assembly.fa.gz",
param = param)
# View snps
rowRanges(vcf)
# View INFO field variables read
info(vcf)
# Histogram of DP
hist(info(vcf)$DP, right = FALSE, breaks = 40, col = "firebrick")
# Histogram of QD
hist(info(vcf)$QD, right = FALSE, breaks = 40, col = "firebrick")
# Histogram of FS
hist(info(vcf)$FS, right = FALSE, breaks = 40, col = "firebrick")
Set geno = c("GQ")
when reading the random sample of
SNPs to view the distribution of genotype quality scores (GQ).
param <- ScanVcfParam(fixed = NA, geno = c("GQ"), info = NA)
# Read VCF file into R
vcf <- readVcf(file = "global.chrXXI.snp.10K.vcf.gz",
genome = "stickleback_v5_assembly.fa.gz",
param = param)
GQ <- as.vector(geno(vcf)$GQ)
hist(GQ, right = FALSE, breaks = 40, col = "firebrick")
We largely follow GATK’s recommendations on thresholds for hard filtering SNPs and indels according to quality metrics contained in the VCF file, but we confirmed and/or modified based on plots of the data (see previous section).
The following VariantFiltration
command hard-filters the
snps. SNPs failing the filter are not removed yet. Instead, a FILTER
variable in the output file is annotated as PASS if the SNP passes the
filter. If a SNP fails a filter, the FILTER variable is annotated with
the name of the filter it failed.
In the version of GATK used here, I got many warnings concerning MQRankSum and ReadPosRankSum, but online forums about this say that the warnings should be ignored.
If a SNP is missing a quality annotation (e.g., no MQ value is given
for a given snp) then by default it passes that test (change this by
setting the --missing-values-evaluate-as-failing
argument
to true).
The subsequent SelectVariants
command removes the SNPs
failing the hard-filter step.
gatk --java-options "-Xmx4g" VariantFiltration \
-V global.chrXXI.snp.vcf.gz \
-filter "QD < 2.0" --filter-name "QD2" \
-filter "FS > 60.0" --filter-name "FS60" \
-filter "SOR > 3.0" --filter-name "SOR3" \
-filter "MQ < 40.0" --filter-name "MQ40" \
-filter "MQRankSum < -4.0" --filter-name "MQRankSum-4" \
-filter "ReadPosRankSum < -4.0" --filter-name "ReadPosRankSum-4" \
-filter "ExcessHet > 54.69" --filter-name "ExcessHet5469" \
-O global.chrXXI.snp.hf.vcf.gz
gatk --java-options "-Xmx4g" SelectVariants \
-V global.chrXXI.snp.hf.vcf.gz \
--exclude-filtered \
-O global.chrXXI.snp.filtered.vcf.gz
We used the following code to hard filter indels:
gatk --java-options "-Xmx2g" VariantFiltration \
-V global.chrXXI.indel.vcf.gz \
-filter "QD < 2.0" --filter-name "QD2" \
-filter "QUAL < 20.0" --filter-name "Q20" \
-filter "FS > 60.0" --filter-name "FS60" \
-filter "ReadPosRankSum < -4.0" \--filter-name "ReadPosRankSum-4" \
-filter "ExcessHet > 54.69" --filter-name "ExcessHet5469" \
-O global.chrXXI.indel.hf.vcf.gz
# Now,remove indels that failed the filters
gatk --java-options "-Xmx2g" SelectVariants \
-V global.chrXXI.indel.hf.vcf.gz \
--exclude-filtered \
-O global.chrXXI.indel.filtered.vcf.gz
Recent versions of GATK indicate missing genotypes as “0/0”, i.e., the homozygous reference genotype instead of the missing symbol, “./.”. We need to recode to distinguish missing genotypes from true reference homozygotes.
Filtering SNPs and indels according to genotype quality (GQ) will convert missing and other poor quality genotypes to “./.” Here I recode as missing (“./.”) any genotype in the SNP VCF file whose Genotype Quality (GQ) is less than 10.
After hard-filtering genotypes, some ALT alleles at a SNP might no
longer occur in genotypes present in the file. The additional
SelectVariants
command will clear them out. Some SNPs might
become invariants — such sites will still be present in the output VCF
file, but will be indicated by a “.” in the ALT field.
vcftools --gzvcf global.chrXXI.snp.filtered.vcf.gz \
--minGQ 10 --recode --recode-INFO-all \
--out global.chrXXI.snp.filtered
bgzip global.chrXXI.snp.filtered.recode.vcf > global.chrXXI.snp.filtered.recode.vcf.gz
mv global.chrXXI.snp.filtered.recode.vcf.gz global.chrXXI.snp.GQfiltered.vcf.gz
tabix -p vcf global.chrXXI.snp.filtered.recode.vcf.gz
gatk --java-options "-Xmx4g" SelectVariants \
-V global.chrXXI.snp.filtered.recode.vcf.gz \
--remove-unused-alternates \
-O global.chrXXI.snp.GQfiltered.vcf.gz
Here is the same procedure applied to indels.
vcftools --gzvcf global.chrXXI.indel.filtered.vcf.gz \
--minGQ 10 --recode --recode-INFO-all \
--out global.chrXXI.indel.filtered
bgzip global.chrXXI.indel.filtered.recode.vcf > global.chrXXI.indel.filtered.recode.vcf.gz
mv global.chrXXI.indel.filtered.recode.vcf.gz global.chrXXI.indel.GQfiltered.vcf.gz
tabix -p vcf global.chrXXI.indel.filtered.recode.vcf.gz
gatk --java-options "-Xmx4g" SelectVariants \
-V global.chrXXI.indel.filtered.recode.vcf.gz \
--remove-unused-alternates \
-O global.chrXXI.indel.GQfiltered.vcf.gz
Here is a similar procedure applied to invariant sites. The invariant VCF file file has an RGQ field indicating genotype quality rather than a GQ field. We are lenient and use a mininum read depth DP of 1 to call genotypes. Genotypes based on DP < 1 are converted to missing.
vcftools --gzvcf global.chrXXI.invar.vcf.gz \
--minDP 1 --recode --recode-INFO-all \
--out global.chrXXI.invar.filtered
mv global.chrXXI.invar.filtered.recode.vcf global.chrXXI.invar.DPfiltered.vcf
bgzip global.chrXXI.invar.DPfiltered.vcf
tabix -p vcf global.chrXXI.invar.DPfiltered.vcf.gz
Many analysts keep only biallelic SNPs (and indels). This step can be
accomplished by inserting the extra line
--restrict-alleles-to BIALLELIC \
into one the
above SelectVariants
commands, perhaps right after
--remove-unused-alternates
.
Alternatively, this can be done later in R, which I show here.
After processing VCF files separately by chromosome, it may be
necessary to merge them into a single large VCF file. Use GATK’s
MergeVcfs
instead if the chromosomes aren’t listed in the
same order as in the headers of each of the individual files. Use
SortVcf
instead if the individual VCF files are not sorted
by coordinates.
GatherVcfs
You can use use GatherVcfs in GATK to merge VCF files. The chromosomes should be listed in the same order as in the headers of each of the individual files and cannot have overlapping positions. The samples must be the same in each of the files.
gatk --java-options "-Xmx4G" GatherVcfs \
-I subset.chrI.snp.filtered.vcf.gz \
-I subset.chrII.snp.filtered.vcf.gz \
-I subset.chrIII.snp.filtered.vcf.gz \
-I subset.chrIV.snp.filtered.vcf.gz \
-I subset.chrIX.snp.filtered.vcf.gz \
-I subset.chrV.snp.filtered.vcf.gz \
-I subset.chrVI.snp.filtered.vcf.gz \
-I subset.chrVII.snp.filtered.vcf.gz \
-I subset.chrVIII.snp.filtered.vcf.gz \
-I subset.chrX.snp.filtered.vcf.gz \
-I subset.chrXI.snp.filtered.vcf.gz \
-I subset.chrXII.snp.filtered.vcf.gz \
-I subset.chrXIII.snp.filtered.vcf.gz \
-I subset.chrXIV.snp.filtered.vcf.gz \
-I subset.chrXIX.snp.filtered.vcf.gz \
-I subset.chrXV.snp.filtered.vcf.gz \
-I subset.chrXVI.snp.filtered.vcf.gz \
-I subset.chrXVII.snp.filtered.vcf.gz \
-I subset.chrXVIII.snp.filtered.vcf.gz \
-I subset.chrXX.snp.filtered.vcf.gz \
-I subset.chrXXI.snp.filtered.vcf.gz \
-O subset.merged.snp.filtered.vcf.gz
gatk IndexFeatureFile -I subset.merged.snp.filtered.vcf.gz
Login to a computer system using ssh
in a terminal
window. From off campus you might need to run myVPN and endure
multi-factor authentication (not needed yet on Compute Canada).
For example, to login to Graham at Compute Canada:
ssh -X -Y YOUR_NAME@graham.computecanada.ca
Or to log into ARC Sockeye:
ssh -Y -X YOUR_NAME@sockeye.arc.ubc.ca # use cwl password
Compute Canada and ARC Sockeye keep track of how efficiently you use its memory and CPU resources. Wasting resources reduces your priority rating (future jobs will sit longer in the queue). So it is a good idea to request the minimum amount that you need, and in the case of interactive more exiting as soon as you are done so resources don’t sit idle. See the section below on monitoring jobs to figure out what resources your successful jobs used.
Compute Canada has a lot of software pre-installed, including GATK, R, and most Bioconductor packages.
ARC Sockeye doesn’t have a lot of software installed, but you can load the Compute Canada software module and then have access to all the same software, as follows:
# Sockeye only
module load CVMFS_CC
The ZCU Cluster doesn’t have an up-to-date software list.
Find the most recent installed version of the program listed here, if using Compute Canada, and load the corresponding module.
For example, to run a GATK command (the example here creates a
sequence dictionary of the stickleback reference genome), load the
gatk
module and then run the command
gatk --java-options "-Xmx2g" CreateSequenceDictionary \
-R stickleback_v5_assembly.fa
You can run Compute Canada software on ARC Sockeye by loading this module:
# Sockeye only
module load CVMFS_CC
module gatk/4.2.4.0
gatk --java-options "-Xmx2g" CreateSequenceDictionary \
-R stickleback_v5_assembly.fa
If you are using a Mac, install XQuartz so that R can display plots on the screen of your local computer.
To run R on Compute Canada, load the module first.
module load gcc/9.3.0 r/4.1.0 r-bundle-bioconductor/3.14
R
If using the Compute Canada software module on ARC Sockeye, type
# on ARC Sockeye only
module load CVMFS_CC
module load gcc/9.3.0 r/4.1.0 r-bundle-bioconductor/3.14
R
On the ZCU Cluster, to run R and access Bioconductor packages type
# On the ZCU Cluster only
/Linux/R-4.2.0/bin/R
Type library()
on the R command line to check which R
packages are already available. Additional packages can be installed in
your local folder. For example:
# in R
install.packages("hierfstat")
Most work on Compute Canada and ARC Sockeye will be done in a
scratch
folder, which has lots of disk space but is not
backed up. Inactive files eventually expire.
It can be convenient, the first time you use the system, to create a link to your scratch folder by typing the following into a command window.
ln -s /scratch/YOUR_NAME scratch # make a symbolic link
cd scratch # go to your scratch folder
For tasks on Compute Canada or ARC Sockeye requiring significant time and memory resources, try running your code first in interactive mode. This will help you decide how much resources you’ll need if you run similar jobs in batch mode.
On Compute Canada, start an interactive session with the following
unix command, which requests 32G of memory on a single core for 1 hour.
You may need to wait a while for the interactive job to begin. If your
work is done before the requested time is up, it is important to enter
exit
to end the interactive session so that resources don’t
sit idle until your full time runs out.
salloc --time=1:0:0 --ntasks-per-node=1 --mem-per-cpu=32G --account=def-schluter
[type your commmands]
exit
Quick reference for a few useful tasks.
Lots of files are generated when processing genome sequence data, which starts to eat up disk space. Remove files created along the way that you no longer need.
To check disk space usage on a Compute Canada computer on all file systems, type the following command on your unix command line.
diskusage_report
Use the following command instead on ARC Sockeye.
print_quota
Especially on Compute Canada machines, your priority (affecting how quickly your queued job is run) declines as you use (or waste) resources. It takes a week or so to rebuild. One strategy when this happens is to switch to a different computer system within Compute Canada. If you have been working on Graham and your jobs are bogged down in the queue, move files to Beluga and try again. Priority is computer-specific.
How you tell my group’s priority when you are logged in to a Compute Canada computer system:
sshare -l -A def-schluter_cpu -a --format=account,User,EffectvUsage,LevelFS
ComputeCanada uses SLURM to schedule batch jobs (so does ARC Sockeye as of November, 2023).
Follow the instructions here on how to create a job script. To submit the job, enter
# in unix
sbatch YOUR_SCRIPT.sh
Other useful commands include:
# Check the status of all jobs submitted
squeue -u YOUR_NAME
# Cancel a running job having job id 123456
scancel 123456
# Check resources used by successfully completed job 123456
seff 123456
An ExitCode of 0 means that no errors were generated. A low CPU Efficiency means that you requested more resources time than you needed. A low Memory Efficiency means that you requested many more memory than you needed.
To check resources used by all jobs submitted between two dates.
sacct --user=YOUR_NAME --starttime=yyyy-mm-dd --endtime=yyyy-mm-dd \
--format=JobID,JobName,AllocCPUS,ExitCode,MaxRSS,Elapsed,UserCPU
© 2009-2024 Dolph Schluter