Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007;8(7):R143.
doi: 10.1186/gb-2007-8-7-r143.

Accuracy and quality of massively parallel DNA pyrosequencing

Affiliations

Accuracy and quality of massively parallel DNA pyrosequencing

Susan M Huse et al. Genome Biol. 2007.

Abstract

Background: Massively parallel pyrosequencing systems have increased the efficiency of DNA sequencing, although the published per-base accuracy of a Roche GS20 is only 96%. In genome projects, highly redundant consensus assemblies can compensate for sequencing errors. In contrast, studies of microbial diversity that catalogue differences between PCR amplicons of ribosomal RNA genes (rDNA) or other conserved gene families cannot take advantage of consensus assemblies to detect and minimize incorrect base calls.

Results: We performed an empirical study of the per-base error rate for the Roche GS20 system using sequences of the V6 hypervariable region from cloned microbial ribosomal DNA (tag sequencing). We calculated a 99.5% accuracy rate in unassembled sequences, and identified several factors that can be used to remove a small percentage of low-quality reads, improving the accuracy to 99.75% or better.

Conclusion: By using objective criteria to eliminate low quality data, the quality of individual GS20 sequence reads in molecular ecological applications can surpass the accuracy of traditional capillary methods.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Low quality reads contribute disproportionately to the overall error rate. The graph shows the proportion of reads and test fragments at each percent difference from their reference sequence (individual error rate) and the proportion of errors contributed by reads and test fragments at each given difference (cumulative error rate). The vast majority of both experimental and test fragment reads contain few or no errors; only 5% of all reads and 0.6% of test fragments differ from their reference sequence by 2% or more. The experimental reads that have errors, however, are likely to have a large number of errors and thus be quite different from their reference sequence. For instance, 40% of all errors are from the 1% reads differing by at least 10% from their reference sequence. The GS20 test fragments, by contrast, show far fewer very low-quality sequences: only 3% of the test fragment errors are from sequences at least 10% different from their reference.
Figure 2
Figure 2
Quality scores are of limited use in predicting accuracy of unknown sequences. The quality scores reported by the GS20 software correlate with decreased confidence in calling the correct homopolymer length rather than the accuracy of the called bases. (a, b) The average quality score of reads decreases as the number of errors in the read increases. (c) The average quality score as a function of position in the homopolymer: as the length of the homopolymer increases, the quality scores decrease, for both correctly and incorrectly called bases. (d) The average quality scores of perfect reads containing differing numbers of homopolymers. The average quality scores decrease with the number of homopolymers. Our sequences contain only short homopolymers, primarily 3-mers. As the length and frequency of homopolymers increases, the expected quality scores will decrease. Without a priori knowledge of the number and length of homopolymers in a particular read, it will be difficult to assess an appropriate quality threshold - a low threshold may not cull data adequately and a high threshold may remove homopolymeric regions.
Figure 3
Figure 3
Error rates increase as read length diverges from predicted. The graphs show the average difference from the reference sequence for all reads of a given length, and the distribution of read lengths for all reads. The majority of reads peak at a few specific lengths. The number of reads beyond the peaks shown are too few to appear on the graph; however, they contain many more errors than the reads of the majority length(s). Perfect reads peak at only a few specific lengths. Sequences that fall outside of these lengths are unlikely to be truncated sequences or to have sequenced beyond the end of the primer. Instead they tend to be low-quality reads of spurious sequences. (a) The average error rate of sequences at each length for 56,700 reads of reference sequence 517. (b) The average error rate of sequences at each length for all reads combined. Even with a mixture of sequence lengths, the reads outside of the peak lengths are highly error prone.
Figure 4
Figure 4
The presence of Ns correlates well with sequencing errors. The presence of an N in the sequence indicates the GS20's inability to accurately call a base at that position within the sequence. The number of other sequencing errors (substitutions, insertions and deletions) within a sequence read correlates with number of uncalled bases. By removing all reads that contain one or more Ns, the overall sequencing error rate drops substantially.

Similar articles

Cited by

References

    1. Ronaghi M, Uhlén M, Nyrén P. DNA sequencing: a sequencing method based on real-time pyrophosphate. Science. 1998;281:363–365. doi: 10.1126/science.281.5375.363. - DOI - PubMed
    1. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–380. - PMC - PubMed
    1. Goldberg SMD, Johnson J, Busam D, Feldblyum T, Ferriera S, Friedman R, Halpern A, Khouri H, Kravitz SA, Lauro FM, et al. A Sanger/pyrosequencing hybrid approach for the generation of high-quality draft assemblies of marine microbial genomes. Proc Natl Acad Sci USA. 2006;103:11240–11245. doi: 10.1073/pnas.0604351103. - DOI - PMC - PubMed
    1. Moore MJ, Dhingra A, Soltis PS, Shaw R, Farmerie WG, Folta KM, Soltis DE. Rapid and accurate pyrosequencing of angiosperm plastid genomes. BMC Plant Biol. 2006;6:17. doi: 10.1186/1471-2229-6-17. - DOI - PMC - PubMed
    1. Wicker T, Schlagenhauf E, Graner A, Close T, Keller B, Stein N. 454 sequencing put to the test using the complex genome of barley. BMC Genomics. 2006;7:275. doi: 10.1186/1471-2164-7-275. - DOI - PMC - PubMed

Publication types

LinkOut - more resources

  NODES
COMMUNITY 3
Project 1
twitter 2