Microbiology, and especially microbial ecology, has become increasingly dependent on advanced DNA and RNA sequencing technologies. This is most evident with the increasing popularity of the human microbiome and its various impacts on human health. While using DNA sequencing sometimes appears relatively simple (a result of the great efforts made to simplify the user experience), it is actually still a very complicated technique that requires a lot of thought and skill. One aspect that genomic scientists (whether focusing on human or microbial DNA) must always consider is the bias introduced by the sequencing platform itself. This week I want to focus on a recently published manuscript that describes the sequencing error profile associated with some of the most popular Illumina platforms.
We know that sequencing platforms introduce systematic biases. Last year a group showed this to be true when performing 16S rRNA amplicon sequencing on Illumina platforms . This year Schimer et al (from the same lab) expanded on that work by characterizing the errors associated with metagenomic sequencing techniques (i.e. random shotgun sequencing).
The paper aims to address four points:
- Define error rates of substitutions and indels between platforms.
- Identify sequence motifs associated with errors.
- Evaluate ability of quality scores to predicts different error types.
- Compare error removal approaches across platforms.
- Substitutions are more frequent than indels and their frequency varies by platform.
- Errors are associated with trimer motifs that are consistent across sequencing platforms.
- Base errors are associated with low quality scores.
- Quality trimming and Bayes Hammer are most effective for reducing errors when used together.
There was one additional point that I thought was worth noting since Schimer et al didn't really get into it in the paper. The group talks about nucleotide motifs associated with errors, and make a note of error-associated adenine and thymine residues. This is interesting because adenines are used at the end of a sequences after it has read through the DNA fragment. Said another way, when a DNA fragment is shorter than what the sequencing platform is reading, it will read through the DNA and, once it falls off the end of the fragment, start inserting a string of A's as a placeholder. As far as I can tell, the research group did not perform the quality control step of trimming these A (and T for the reverse compliment) strings, meaning their analysis could be picking these up. This would mean that the A's could be throwing off their other analyses such as motif identification and sequence alignments. Because there were other error-associated motifs, it seems unlikely that this point ruins the paper, but it is important to note when interpreting their results.
1. Schirmer, M., Ijaz, U., D'Amore, R., Hall, N., Sloan, W., & Quince, C. (2015). Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform Nucleic Acids Research, 43 (6) DOI: 10.1093/nar/gku1341
2. Schirmer, M., D’Amore, R., Ijaz, U., Hall, N., & Quince, C. (2016). Illumina error profiles: resolving fine-scale variation in metagenomic sequencing data BMC Bioinformatics, 17 (1) DOI: 10.1186/s12859-016-0976-y