Accuracy and quality of massively parallel DNA pyrosequencing

doi:10.1186/gb-2007-8-7-r143

. 2007;8(7):R143.

doi: 10.1186/gb-2007-8-7-r143.

Accuracy and quality of massively parallel DNA pyrosequencing

Susan M Huse¹, Julie A Huber, Hilary G Morrison, Mitchell L Sogin, David Mark Welch

Affiliations

PMID: 17659080
PMCID: PMC2323236
DOI: 10.1186/gb-2007-8-7-r143

Accuracy and quality of massively parallel DNA pyrosequencing

Susan M Huse et al. Genome Biol. 2007.

. 2007;8(7):R143.

doi: 10.1186/gb-2007-8-7-r143.

Authors

Susan M Huse¹, Julie A Huber, Hilary G Morrison, Mitchell L Sogin, David Mark Welch

Affiliation

¹ Josephine Bay Paul Center, Marine Biological Laboratory at Woods Hole, MBL Street, Woods Hole, MA 02543, USA.

PMID: 17659080
PMCID: PMC2323236
DOI: 10.1186/gb-2007-8-7-r143

Abstract

Background: Massively parallel pyrosequencing systems have increased the efficiency of DNA sequencing, although the published per-base accuracy of a Roche GS20 is only 96%. In genome projects, highly redundant consensus assemblies can compensate for sequencing errors. In contrast, studies of microbial diversity that catalogue differences between PCR amplicons of ribosomal RNA genes (rDNA) or other conserved gene families cannot take advantage of consensus assemblies to detect and minimize incorrect base calls.

Results: We performed an empirical study of the per-base error rate for the Roche GS20 system using sequences of the V6 hypervariable region from cloned microbial ribosomal DNA (tag sequencing). We calculated a 99.5% accuracy rate in unassembled sequences, and identified several factors that can be used to remove a small percentage of low-quality reads, improving the accuracy to 99.75% or better.

Conclusion: By using objective criteria to eliminate low quality data, the quality of individual GS20 sequence reads in molecular ecological applications can surpass the accuracy of traditional capillary methods.

PubMed Disclaimer

Figures

**Figure 1**
Low quality reads contribute disproportionately to the overall error rate. The graph shows the proportion of reads and test fragments at each percent difference from their reference sequence (individual error rate) and the proportion of errors contributed by reads and test fragments at each given difference (cumulative error rate). The vast majority of both experimental and test fragment reads contain few or no errors; only 5% of all reads and 0.6% of test fragments differ from their reference sequence by 2% or more. The experimental reads that have errors, however, are likely to have a large number of errors and thus be quite different from their reference sequence. For instance, 40% of all errors are from the 1% reads differing by at least 10% from their reference sequence. The GS20 test fragments, by contrast, show far fewer very low-quality sequences: only 3% of the test fragment errors are from sequences at least 10% different from their reference.

**Figure 2**
Quality scores are of limited use in predicting accuracy of unknown sequences. The quality scores reported by the GS20 software correlate with decreased confidence in calling the correct homopolymer length rather than the accuracy of the called bases. **(a, b)** The average quality score of reads decreases as the number of errors in the read increases. **(c)** The average quality score as a function of position in the homopolymer: as the length of the homopolymer increases, the quality scores decrease, for both correctly and incorrectly called bases. **(d)** The average quality scores of perfect reads containing differing numbers of homopolymers. The average quality scores decrease with the number of homopolymers. Our sequences contain only short homopolymers, primarily 3-mers. As the length and frequency of homopolymers increases, the expected quality scores will decrease. Without *a priori* knowledge of the number and length of homopolymers in a particular read, it will be difficult to assess an appropriate quality threshold - a low threshold may not cull data adequately and a high threshold may remove homopolymeric regions.

**Figure 3**
Error rates increase as read length diverges from predicted. The graphs show the average difference from the reference sequence for all reads of a given length, and the distribution of read lengths for all reads. The majority of reads peak at a few specific lengths. The number of reads beyond the peaks shown are too few to appear on the graph; however, they contain many more errors than the reads of the majority length(s). Perfect reads peak at only a few specific lengths. Sequences that fall outside of these lengths are unlikely to be truncated sequences or to have sequenced beyond the end of the primer. Instead they tend to be low-quality reads of spurious sequences. **(a)** The average error rate of sequences at each length for 56,700 reads of reference sequence 517. **(b)** The average error rate of sequences at each length for all reads combined. Even with a mixture of sequence lengths, the reads outside of the peak lengths are highly error prone.

**Figure 4**
The presence of Ns correlates well with sequencing errors. The presence of an N in the sequence indicates the GS20's inability to accurately call a base at that position within the sequence. The number of other sequencing errors (substitutions, insertions and deletions) within a sequence read correlates with number of uncalled bases. By removing all reads that contain one or more Ns, the overall sequencing error rate drops substantially.

See this image and copyright information in PMC

Cited by

Deep Sequencing of the HIV-1 env Gene Reveals Discrete X4 Lineages and Linkage Disequilibrium between X4 and R5 Viruses in the V1/V2 and V3 Variable Regions.
Zhou S, Bednar MM, Sturdevant CB, Hauser BM, Swanstrom R. Zhou S, et al. J Virol. 2016 Jul 27;90(16):7142-58. doi: 10.1128/JVI.00441-16. Print 2016 Aug 15. J Virol. 2016. PMID: 27226378 Free PMC article.
Fast skeletal muscle transcriptome of the gilthead sea bream (Sparus aurata) determined by next generation sequencing.
Garcia de la Serrana D, Estévez A, Andree K, Johnston IA. Garcia de la Serrana D, et al. BMC Genomics. 2012 May 11;13:181. doi: 10.1186/1471-2164-13-181. BMC Genomics. 2012. PMID: 22577894 Free PMC article.
Challenges with using primer IDs to improve accuracy of next generation sequencing.
Brodin J, Hedskog C, Heddini A, Benard E, Neher RA, Mild M, Albert J. Brodin J, et al. PLoS One. 2015 Mar 5;10(3):e0119123. doi: 10.1371/journal.pone.0119123. eCollection 2015. PLoS One. 2015. PMID: 25741706 Free PMC article.
Metagenomic Insights into Effects of Chemical Pollutants on Microbial Community Composition and Function in Estuarine Sediments Receiving Polluted River Water.
Lu XM, Chen C, Zheng TL. Lu XM, et al. Microb Ecol. 2017 May;73(4):791-800. doi: 10.1007/s00248-016-0868-8. Epub 2016 Oct 15. Microb Ecol. 2017. PMID: 27744476
Phaeocystis antarctica blooms strongly influence bacterial community structures in the Amundsen Sea polynya.
Delmont TO, Hammar KM, Ducklow HW, Yager PL, Post AF. Delmont TO, et al. Front Microbiol. 2014 Dec 19;5:646. doi: 10.3389/fmicb.2014.00646. eCollection 2014. Front Microbiol. 2014. PMID: 25566197 Free PMC article.

See all "Cited by" articles

References

1. Ronaghi M, Uhlén M, Nyrén P. DNA sequencing: a sequencing method based on real-time pyrophosphate. Science. 1998;281:363–365. doi: 10.1126/science.281.5375.363. - DOI - PubMed
1. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–380. - PMC - PubMed
1. Goldberg SMD, Johnson J, Busam D, Feldblyum T, Ferriera S, Friedman R, Halpern A, Khouri H, Kravitz SA, Lauro FM, et al. A Sanger/pyrosequencing hybrid approach for the generation of high-quality draft assemblies of marine microbial genomes. Proc Natl Acad Sci USA. 2006;103:11240–11245. doi: 10.1073/pnas.0604351103. - DOI - PMC - PubMed
1. Moore MJ, Dhingra A, Soltis PS, Shaw R, Farmerie WG, Folta KM, Soltis DE. Rapid and accurate pyrosequencing of angiosperm plastid genomes. BMC Plant Biol. 2006;6:17. doi: 10.1186/1471-2229-6-17. - DOI - PMC - PubMed
1. Wicker T, Schlagenhauf E, Graner A, Close T, Keller B, Stein N. 454 sequencing put to the test using the complex genome of barley. BMC Genomics. 2006;7:275. doi: 10.1186/1471-2164-7-275. - DOI - PMC - PubMed

Publication types

Actions
Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

[1] Ronaghi M, Uhlén M, Nyrén P. DNA sequencing: a sequencing method based on real-time pyrophosphate. Science. 1998;281:363–365. doi: 10.1126/science.281.5375.363. - DOI - PubMed

[2] Ronaghi M, Uhlén M, Nyrén P. DNA sequencing: a sequencing method based on real-time pyrophosphate. Science. 1998;281:363–365. doi: 10.1126/science.281.5375.363. - DOI - PubMed

[3] Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–380. - PMC - PubMed

[4] Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–380. - PMC - PubMed

[5] Goldberg SMD, Johnson J, Busam D, Feldblyum T, Ferriera S, Friedman R, Halpern A, Khouri H, Kravitz SA, Lauro FM, et al. A Sanger/pyrosequencing hybrid approach for the generation of high-quality draft assemblies of marine microbial genomes. Proc Natl Acad Sci USA. 2006;103:11240–11245. doi: 10.1073/pnas.0604351103. - DOI - PMC - PubMed

[6] Goldberg SMD, Johnson J, Busam D, Feldblyum T, Ferriera S, Friedman R, Halpern A, Khouri H, Kravitz SA, Lauro FM, et al. A Sanger/pyrosequencing hybrid approach for the generation of high-quality draft assemblies of marine microbial genomes. Proc Natl Acad Sci USA. 2006;103:11240–11245. doi: 10.1073/pnas.0604351103. - DOI - PMC - PubMed

[7] Moore MJ, Dhingra A, Soltis PS, Shaw R, Farmerie WG, Folta KM, Soltis DE. Rapid and accurate pyrosequencing of angiosperm plastid genomes. BMC Plant Biol. 2006;6:17. doi: 10.1186/1471-2229-6-17. - DOI - PMC - PubMed

[8] Moore MJ, Dhingra A, Soltis PS, Shaw R, Farmerie WG, Folta KM, Soltis DE. Rapid and accurate pyrosequencing of angiosperm plastid genomes. BMC Plant Biol. 2006;6:17. doi: 10.1186/1471-2229-6-17. - DOI - PMC - PubMed

[9] Wicker T, Schlagenhauf E, Graner A, Close T, Keller B, Stein N. 454 sequencing put to the test using the complex genome of barley. BMC Genomics. 2006;7:275. doi: 10.1186/1471-2164-7-275. - DOI - PMC - PubMed

[10] Wicker T, Schlagenhauf E, Graner A, Close T, Keller B, Stein N. 454 sequencing put to the test using the complex genome of barley. BMC Genomics. 2006;7:275. doi: 10.1186/1471-2164-7-275. - DOI - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Accuracy and quality of massively parallel DNA pyrosequencing

Affiliation

Accuracy and quality of massively parallel DNA pyrosequencing

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources