The next step in the pipeline after the alignments are conducted and adjusted is to call the SNPs and genotypes. There are many programs that will do this such as the ML script that many people in the Rieseberg lab use, freeBayes, mpileup and bcftools, as well as the GATK UnifiedGenotyper. To bench mark the alignment steps in these SNP calling blog posts I used mpileup and bcftools. For example:
samtools mpileup -u -D -f <your reference> *.bam | bcftools view -bvcg – > var.raw.bcf
bcftools view var.raw.bcf > var.flt.vcf
When genotyping 300 individuals, mpileup was taking a very very long time. As there is no option to use multiple processors you can run mpileup on each contig separately and then combine the vcf files at the end. That way you can run several at once depending on the number of processors you have using parallel and the short shell script below.
cat <list of your contig names> | parallel –j<processor number> mpileup.sh
mpileup.sh:
samtools mpileup -u -D –r $1 -f <your reference> *.bam | bcftools view -bvcg – > var.raw.bcf
The UnifiedGenotyper does have an option for using multiple processors:
java -jar <GATK_PATH>/GenomeAnalysisTK.jar -R <REF> -T UnifiedGenotyper -I <your file1>.bam -I <your file2>.bam (keep adding all your sorted and indexed bam files in this way) -o <out>.vcf -ntc 10
In a comparison of the UnifiedGenotyper and mpileup, I found more SNPs were called with mpileup and 289,236 of the filtered SNPs overlapped:
All SNP | Filtered SNPs | |
UnifiedGenotyper (bwa mem, mark duplicates, indel realignment, BQ off (default)) | 2,983,495 | 331,320 |
mpileup/bcftools (bwa mem, mark duplicates, indel realignment, BQ off (-BQ0)) | 2,987,196 | 394,724 |