The GSC supplies our raw sequence reads as bam files. Some programs will take unaligned bam files as input (the bwa is one), but many still do not. A much more flexible format is FASTQ. Here is a link to bam2fastq, a simple little conversion program:
http://www.hudsonalpha.org/gsl/software/bam2fastq.php
Continuing my transcriptome assembly narrative, I would log on to the cluster:
ssh -t redbeard3@zoology.ubc.ca -2 cluster
move to the directory that I put the raw data in:
cd anomalus_assembly/
and try to convert the file:
bam2fastq 70CG7AAXX_2_ACCCAG.bam
Only to find that the program isn’t installed! It’s OK, we can just make a little local installation.
First, lets make a ‘bin’ directory, where we can put all of the project’s software:
mkdir bin
we’ll need to fetch the source code:
cd bin
wget http://www.hudsonalpha.org/gsl/software/bam2fastq-1.1.0.tgz
extract it:
tar -xzvf bam2fastq-1.1.0.tgz
compile it:
cd bam2fastq-1.1.0
make
and clean up the mess:
mv bam2fastq ../
cd ..
rm -r bam2fastq-1.1.0*
Now we can get back to that conversion:
cd ..
bin/bam2fastq -o Ano1495#.fq --no-filtered 70CG7AAXX_2_ACCCAG.bam
Oh man, this is taking a while. I should have used ‘screen’….
Alright, it’s done! Now that I have some handy fastQ files, I’ll remove that bam file to save disk space:
ls
prints:
70CG7AAXX_2_ACCCAG.bam Ano1495_1.fq Ano1495_2.fq bin
rm 70CG7AAXX_2_ACCCAG.bam
Great post, Chris – very helpful. A small note (I know you know this, but many others don’t) – for unix processes that unexpectedly take a long time, if you have run the process in the foreground (without an & at the end of the command) simply type ctrl-z (ctrl and z at the same time), which pauses the process, then type
bg
which restarts the process in the background. I believe the process will generally continue even if you logout of sciborg – it has for me, so they must have things set up to not terminate processes on logout. However, on other machines or to make sure that it is not terminated when you log out, you can do:
disown -h [process id]
or
disown -r -h
for all running jobs.
thanks Chris, this worked very smoothly for me! one note: the documentation for bam2fastq (http://www.hudsonalpha.org/gsl/software/bam2fastq.php) is a bit confusing on what is happening with the QC-failed reads.
“–filtered
–no-filtered
Reads that are marked as failing QC checks will (will not) be extracted. [Default: extract filtered reads] ”
I originally read this that using the option “–no-filtered” would extract all reads (only a problem if the quality information is then discarded), but in fact this option means that only the reads that passed QC checks are included in the new fastq files.