Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2015 May 1;14(5):1993-2001.
doi: 10.1021/pr501138h. Epub 2015 Apr 22.

Review, evaluation, and discussion of the challenges of missing value imputation for mass spectrometry-based label-free global proteomics

Affiliations
Review

Review, evaluation, and discussion of the challenges of missing value imputation for mass spectrometry-based label-free global proteomics

Bobbie-Jo M Webb-Robertson et al. J Proteome Res. .

Abstract

In this review, we apply selected imputation strategies to label-free liquid chromatography-mass spectrometry (LC-MS) proteomics datasets to evaluate the accuracy with respect to metrics of variance and classification. We evaluate several commonly used imputation approaches for individual merits and discuss the caveats of each approach with respect to the example LC-MS proteomics data. In general, local similarity-based approaches, such as the regularized expectation maximization and least-squares adaptive algorithms, yield the best overall performances with respect to metrics of accuracy and robustness. However, no single algorithm consistently outperforms the remaining approaches, and in some cases, performing classification without imputation sometimes yielded the most accurate classification. Thus, because of the complex mechanisms of missing data in proteomics, which also vary from peptide to protein, no individual method is a single solution for imputation. On the basis of the observations in this review, the goal for imputation in the field of computational proteomics should be to develop new approaches that work generically for this data type and new strategies to guide users in the selection of the best imputation for their dataset and analysis objectives.

Keywords: Imputation; accuracy; classification; label free; mean-square error; peak intensity.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Average log10 intensity as measured by peptide peak area in the control group versus fraction of missing values and peptide counts associated with bins corresponding to the fraction of missing data comparing phenotypes and exposures for datasets from (A) human plasma and (B) mouse lung. The control group for the human plasma is the normal glucose tolerant (NGT) samples, and the sham group for the mouse lung is the regular weight mice with no lipopolysaccharide (LPS) exposure. The vertical red line represents median average intensity, and the horizontal red line represents the point that 50% of the values are missing. The red numbers are the fraction of peptides that fall into the four boxes separated by the red lines.
Figure 2
Figure 2
Boxplot of the average log10 CV(RMSE) for the imputed dilution series datasets (Table 1) at the (A) peptide and (B) protein levels. The lower line represents the 25th percentile, the upper line of the box represents the 75th percentile, and the inner line corresponds to the median log10 CV(RMSE).
Figure 3
Figure 3
95% confidence intervals of the ranks to compare all imputation algorithms based on classification accuracy for the mouse lung and human plasma data at the peptide level (A, C) and at the protein level (B, D), respectively. Single-value imputation algorithms are colored black, local imputation algorithms are red, and methods that estimate the principal components directly from the data without imputation are blue. Imputation algorithms with no overlap in their confidence intervals are statistically different at α of 0.05, and larger rank is equivalent to larger classification accuracy (shown as percent).
Figure 4
Figure 4
Comparison of each imputation algorithm based on the average rank of imputation algorithms achieved via CV(RMSE) versus the average rank achieved via classification accuracy at the peptide and protein levels. The green line represents the ranking of sppPCA based on classification accuracy since its improvement in variance is unknown given that it does not impute data.
Figure 5
Figure 5
Comparison of the peptide and protein accuracy metrics for (A) dilution, (B) mouse lung, and (c) human plasma datasets.

Similar articles

Cited by

References

    1. Van Oudenhove L, Devreese B. A review on recent developments in mass spectrometry instrumentation and quantitative tools advancing bacterial proteomics. Appl. Microbiol. Biotechnol. 2013;97:4749–4762. - PubMed
    1. Zhang AH, Sun H, Yan GL, et al. Serum proteomics in biomedical research: a systematic review. Appl. Biochem. Biotechnol. 2013;170:774–786. - PubMed
    1. Bantscheff M, Lemeer S, Savitski MM, et al. Quantitative mass spectrometry in proteomics: critical review update from 2007 to the present. Anal Bioanal Chem. 2012;404:939–965. - PubMed
    1. Wright PC, Noirel J, Ow SY, et al. A review of current proteomics technologies with a survey on their widespread use in reproductive biology investigations. Theriogenology. 2012;77:738–765. e752. - PubMed
    1. Parker CE, Pearson TW, Anderson NL, et al. Mass-spectrometry-based clinical proteomics—a review and prospective. Analyst. 2010;135:1830–1838. - PMC - PubMed

Publication types

LinkOut - more resources

  NODES
twitter 2
USERS 1