Methods Write-up for Identifying Contamination in Sequencing Reads

Chris and I devised and implemented this method to identify contaminant DNA in our HA412 454 and Illumina WGS reads.

N.B. I will update this posts with links to the scripts once I get them up on the lab’s bitbucket account, plus extra details if I think they are necessary on review.

  1. BLAST Assemblies Against NCBI NT Database
    We identified unique taxa, and fetched their lineages from NCBI using heyaudy.com (bitbucket:wgetLineage.pl). We kept hits which matched with non-plant taxa. (bitbucket: blast.sh)
  2. Assessment of Contaminant Hits
    We identified contaminant hits with a homology of >95%, and used these to identify several organisms/genera from which all of the contamination came.
  3. Genome to Genome BLAST
    We then blasted the relevant genomes, plus the UniVec database (a database of artificial sequences) against the assemblies of interest (bitbucket: blastwithgenomes.sh, blast.univec.sh, scaffoldstats.pl), and extracted hits based on the following criteria:
    “Blacklist” sequences had >=98% homology, and > 100bp (length was not used to filter UniVec sequences), plus, no plant hits with an overlap of > 20bp.
    “Greylist” sequences needed >85% homolgy and the same length criteria, but were allowed for overlapping plant hits, as long as the bit score for the non-plant hit was greater than the plant bitscore.
  4. Contaminant Hit Sequence Extraction
    We extracted sequences which fit the criteria for likely contamination from each round of BLAST. (bitbucket: extract_bad_hit_sequence.univec.pl, extract_bad_hit_sequence.genomeblast.pl, extract_bad_hit_sequence.NTblast.pl), pruned them for duplicates, and split them according to the criteria in section 3 to create the final contaminant Grey and Blacklists.