When you get your GBS data from the sequencer, all your samples will be together in one file. The first step is to pull out the reads for individual samples using your barcodes. There are many ways of doing this, but here is one.
This script will demultiplex two enzyme PstI-MspI GBS with dual barcodes. It will also work if you have a mix of barcoded and unbarcoded common adaptors. It also lets there be single basepair errors in your barcode sequences, which seems to recover about 2% more reads. It will trim off the barcode sequence it finds, and also the enzyme cut site. The cut site is real sequence but can’t be variable (otherwise it wouldn’t be cut and sequenced), so isn’t really good for pop gen studies.
You need a barcode sequence file for this which is a tab-delimited text file that has three columns: sample_name, barcode_1, barcode_2.
If you can’t remember what your barcodes use my other script highlighted in another post.
Hi Greg, Thanks for the scripts!
About the trimming, I have one question: although the cut site can’t be variable, will keeping these real sequences in the genome help increase aligning accuracy before SNP calling? (I don’t know how much difference it could make if reads are only several nucleotides longer. )
I agree that the cut sites can be used to help increase aligning accuracy, it just gets complicated cutting them out at a later step. You’d want to align reads, and then remove the cut sites out of the bam file. This is doable, but not something I’ve done. If you want a version of this script that doesn’t remove the cut sites I can give you that, it’s a simple switch.