GNU Parallel makes it easy to take advantage of multiple processors at once. I really enjoy using it to parallelize Bash scripts. For example, here’s a pipeline that will generate a reference-based transcriptome from Illumina-sequenced ESTs: . . .
#!/bin/bash # ESTpipeline.sh # prefix=$1 # align each mate, trimming low quality bases bwa aln -t 2 -I -q 20 TranscriptomeRef_mar30.fa $prefix\_1.fq >$prefix\_1.sai bwa aln -t 2 -I -q 20 TranscriptomeRef_mar30.fa $prefix\_2.fq >$prefix\_2.sai bwa sampe TranscriptomeRef_mar30.fa $prefix\_1.sai $prefix\_2.sai $prefix\_1.fq $prefix\_2.fq >$prefix.sam # sam to bam conversion samtools view -u -t TranscriptomeRef_mar30.fa.fai -S $prefix.sam -o $prefix.bam samtools sort $prefix.bam $prefix.sort samtools index $prefix.sort.bam # realign around indels java -Xmx1g -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R TranscriptomeRef_mar30.fa -o $prefix.realign.intervals -I $prefix.sort.bam > $prefix.realign1.log java -Xmx1g -jar GenomeAnalysisTK.jar -T IndelRealigner -R TranscriptomeRef_mar30.fa -targetIntervals $prefix.realign.intervals -I $prefix.sort.bam -o $prefix.realign.bam > $prefix.realign2.log java -jar AddOrReplaceReadGroups.jar I=$prefix.realign.bam O=$prefix.rehead.bam SORT_ORDER=coordinate RGID=$prefix RGLB=$prefix RGPL=illumina RGSM=4BASE_NAME RGPU=$prefix CREATE_INDEX=True VALIDATION_STRINGENCY=SILENT java -jar GenomeAnalysisTK.jar -l INFO -R TranscriptomeRef_mar30.fa -T UnifiedGenotyper -I $prefix.rehead.bam -o $prefix.vcf --output_mode EMIT_ALL_SITES >$prefix.genotype.log # vcf to diploid vertical alignment conversion echo $prefix >$prefix.Valign cat $prefix.vcf | ./vcf2vertical.pl >>$prefix.Valign # remove intermediate files rm $prefix\_1.sai $prefix\_2.sai $prefix.sam $prefix.bam $prefix.sort.bam $prefix.rehead.bam $prefix.realign.intervals $prefix.vcf
Notice the third line. The script takes a single parameter as input (a file name prefix) and reuses it again and again. Now, say we have a directory of Illumina ESTs represented as FastQ files:
[clhead] illumina $ ls *.fq
Ames449_1.fq ARG1834_1.fq btm17-4_1.fq btm25-2_1.fq btm34-6_1.fq
Ames449_2.fq ARG1834_2.fq btm17-4_2.fq btm25-2_2.fq btm34-6_2.fq
Ames695_1.fq arg2B-4_1.fq btm19-1_1.fq btm26-4_1.fq btm5-1_1.fq
Ames695_2.fq arg2B-4_2.fq btm19-1_2.fq btm26-4_2.fq btm5-1_2.fq
arg11B-11_1.fq arg4B-8_1.fq btm19-2_1.fq btm27-3_1.fq btm7B-14_1.fq
arg11B-11_2.fq arg4B-8_2.fq btm19-2_2.fq btm27-3_2.fq btm7B-14_2.fq
arg14B-7_1.fq arg6B-1_1.fq btm20-8_1.fq btm30-6_1.fq btm9-4_1.fq
arg14B-7_2.fq arg6B-1_2.fq btm20-8_2.fq btm30-6_2.fq btm9-4_2.fq
ARG1805_1.fq btm10-5_1.fq btm21-4_1.fq btm31-6_1.fq
ARG1805_2.fq btm10-5_2.fq btm21-4_2.fq btm31-6_2.fq
ARG1820_1.fq btm13-4_1.fq btm22-8_1.fq btm32-3_1.fq
ARG1820_2.fq btm13-4_2.fq btm22-8_2.fq btm32-3_2.fq
If we make a list of the file name prefixes, we can easily process them all in parallel, taking advantage of all the cores available on the computer:
[crunch09] illumina $ ls *_1.fq | sed -e "s/_1.fq//" >prefix.list ; cat prefix.list
Ames449
Ames695
arg11B-11
arg14B-7
ARG1805
ARG1820
ARG1834
arg2B-4
arg4B-8
arg6B-1
btm10-5
btm13-4
btm17-4
btm19-1
btm19-2
btm20-8
btm21-4
btm22-8
btm25-2
btm26-4
btm27-3
btm30-6
btm31-6
btm32-3
btm34-6
btm5-1
btm7B-14
btm9-4
How many jobs should we launch? We can count the computer’s processors like this:
[crunch09] ~ $ cat /proc/cpuinfo | grep -c processor
16
There are 16 processors on this node. We can check their load using top:
[crunch09] ~ $ top
top - 18:28:37 up 160 days, 12:01, 1 user, load average: 0.00, 0.00, 0.00
Tasks: 276 total, 1 running, 274 sleeping, 0 stopped, 1 zombie
Cpu(s): 0.0%us, 0.0%sy, 0.0%ni, 99.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 49433544k total, 34412240k used, 15021304k free, 178668k buffers
Swap: 4194296k total, 0k used, 4194296k free, 33565872k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
22493 fitzjohn 16 0 30960 2448 1568 S 0.7 0.0 30:24.87 top
17259 grassa 15 0 30868 2308 1560 R 0.3 0.0 0:00.08 top
1 root 15 0 10348 704 592 S 0.0 0.0 0:02.21 init
2 root RT -5 0 0 0 S 0.0 0.0 0:03.35 migration/0
3 root 34 19 0 0 0 S 0.0 0.0 0:00.07 ksoftirqd/0
4 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/0
5 root RT -5 0 0 0 S 0.0 0.0 0:00.78 migration/1
6 root 34 19 0 0 0 S 0.0 0.0 0:00.08 ksoftirqd/1
7 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/1
8 root RT -5 0 0 0 S 0.0 0.0 0:00.60 migration/2
9 root 34 19 0 0 0 S 0.0 0.0 0:00.11 ksoftirqd/2
They’re all free! Those processors should be generating information. We can now use GNU Parallel to execute the Bash script on fourteen processors (leaving two free) at a time:
cat prefix.list | parallel -j 14 ./ESTpipeline.sh
In this example, GNU Parallel will feed each line of standard input to the Bash script as a single parameter which will be executed on one of fourteen processors.
Awesome tool! So much better than my old standby of perl scripts with exec calls.
One note of caution to those new to parallel processing (aka multithreading) – if your programs access the same resources, you might end up with unexpected results. This is called a race condition. Examples are if you run multiple programs that write to the same file or run programs that depend on each other’s output. Unless you do extra steps to ensure that the processes execute in the order you want, you might end up with bad output. If your programs don’t update common resources or depend on each other, then definitely use parallel processing to speed things up.
This is a very important point! It’s the reason behind using a self-contained Bash script, each running on a single core, for the pipeline. One cool thing to note about Parallel is the -k flag, which will keep the output order the same as the input order for simple commands that take STDIN and write to STDOUT.
The man page includes a nice collection of examples:
http://www.gnu.org/software/parallel/man.html
cool. thanks guys!