LocDB: experimental annotations of localization for Homo sapiens and Arabidopsis thaliana

Abstract

LocDB is a manually curated database with experimental annotations for the subcellular localizations of proteins in Homo sapiens (HS, human) and Arabidopsis thaliana (AT, thale cress). Currently, it contains entries for 19 604 UniProt proteins (HS: 13 342; AT: 6262). Each database entry contains the experimentally derived localization in Gene Ontology (GO) terminology, the experimental annotation of localization, localization predictions by state-of-the-art methods and, where available, the type of experimental information. LocDB is searchable by keyword, protein name and subcellular compartment, as well as by identifiers from UniProt, Ensembl and TAIR resources. In comparison to other public databases, LocDB as a resource adds about 10 000 experimental localization annotations for HS proteins and ∼900 for AS proteins. Over 40% of the proteins in LocDB have multiple localization annotations providing a better platform for development of new multiple localization prediction methods with higher coverage and accuracy. Links to all referenced databases are provided. LocDB will be updated regularly by our group (available at: http://www.rostlab.org/services/locDB ).

INTRODUCTION

Proteins are the fundamental functional components of the machinery of life. The particular cellular compartment, in which they reside, i.e. their native subcellular localization, is a key feature that characterizes their physiological functions. Many careful, hypothesis-driven experimental studies have been contributing to our large body of annotations of cellular compartments ( 1–5 ). Recently, high-throughput experiments have stepped up to the challenge to increase the amount of annotations ( 6–15 ). These data sets capture aspects of protein function and, more generally, of global cellular processes.

UniProt (release 2010_07) ( 16 ) constitutes the most comprehensive and, arguably, the most accurate resource with experimental annotations of subcellular localization. However, even this excellent resource remains incomplete for the proteomes from Homo sapiens (HS) and Arabidopsis thaliana (AT): of the 20 282 human proteins in Swiss-Prot ( 17 ), 14 502 have annotations of localization (72%), but for only 3720 (18%) these annotations are experimental. Similarly, of the 9099 AT proteins only 1495 (17%) have experimental annotations of localization. While LocDB stands on and roots UniProtKB, it encompasses this giant and adds specific value by collecting information about subcellular localization from the primary literature and from other databases. These data are enriched by annotations, links and predictions.

DATA SET

Curated entries with experimental data

LocDB contains experimental annotations for subcellular localization of 19 604 UniProt proteins; 13 342 of these are from Homo sapiens [10 102 Swiss-Prot and 3240 TrEMBL ( 17 )] and 6262 from AT (3466 Swiss-Prot, 2796 TrEMBL). This raises the experimental annotations for human from 3720 (18%) to 13 342 (66%), and for thale cress from 1495 (16% of the UniProt subset of AT; note that this subset may constitute as little as 30% of all AT proteins) to 6262 (69% of the UniProt subset of AT). We classify all proteins according to the Gene Ontology ( 18 ) (GO) hierarchy into 12 primary classes of subcellular localization, i.e. use the following classes: cytoplasm, endoplasmic reticulum, endosome, extracellular, Golgi apparatus, mitochondrion, nucleus, peroxisome, plasma membrane, plastid, vacuole and vesicles ( Table 1 ). The proteins are further classified in subclasses of above primary classes denoted as secondary protein localizations, for example, protein RL21_HUMAN is experimentally annotated to be localized in primary: Nucleus and Secondary: Nucleolus.

Table 1.

Open in new tab

Comparison between different localization annotation resources ^a

Subcellular localization	Homo sapiens			Arabidopsis thaliana
	LocDB	LOCATE	Uniprot (2010_07)	LocDB	SUBA II	Uniprot (2010_07)
Cytoplasm	4787	1054	1194	912	452	161
Endoplasmic reticulum	1027	367	185	292	285	52
Endosome	409	448	65	6	10	16
Extracellular	2266	380	33	188	–	8
Golgi apparatus	909	503	134	179	171	51
Mitochondrion	884	282	151	724	700	164
Nucleus	4560	2705	1181	1104	1031	326
Peroxisome	131	128	21	240	265	23
Plasma membrane	3940	1702	878	1835	3189	449
Plastid incl. chloroplast	–	–	–	2420	1945	267
Vacuole	297	250	34	862	849	35
Vesicles	258	99	34	–	–	1

Subcellular localization	Homo sapiens			Arabidopsis thaliana
	LocDB	LOCATE	Uniprot (2010_07)	LocDB	SUBA II	Uniprot (2010_07)
Cytoplasm	4787	1054	1194	912	452	161
Endoplasmic reticulum	1027	367	185	292	285	52
Endosome	409	448	65	6	10	16
Extracellular	2266	380	33	188	–	8
Golgi apparatus	909	503	134	179	171	51
Mitochondrion	884	282	151	724	700	164
Nucleus	4560	2705	1181	1104	1031	326
Peroxisome	131	128	21	240	265	23
Plasma membrane	3940	1702	878	1835	3189	449
Plastid incl. chloroplast	–	–	–	2420	1945	267
Vacuole	297	250	34	862	849	35
Vesicles	258	99	34	–	–	1

^a The numbers in columns show the number of experimentally annotated proteins in each subcellular location in the resources LocDB, LOCATE ( 1 ), SUBA ( 4 ) and UniProt (2010_07) release ( 16 ).

Table 1.

Open in new tab

Comparison between different localization annotation resources ^a

Subcellular localization	Homo sapiens			Arabidopsis thaliana
	LocDB	LOCATE	Uniprot (2010_07)	LocDB	SUBA II	Uniprot (2010_07)
Cytoplasm	4787	1054	1194	912	452	161
Endoplasmic reticulum	1027	367	185	292	285	52
Endosome	409	448	65	6	10	16
Extracellular	2266	380	33	188	–	8
Golgi apparatus	909	503	134	179	171	51
Mitochondrion	884	282	151	724	700	164
Nucleus	4560	2705	1181	1104	1031	326
Peroxisome	131	128	21	240	265	23
Plasma membrane	3940	1702	878	1835	3189	449
Plastid incl. chloroplast	–	–	–	2420	1945	267
Vacuole	297	250	34	862	849	35
Vesicles	258	99	34	–	–	1

Subcellular localization	Homo sapiens			Arabidopsis thaliana
	LocDB	LOCATE	Uniprot (2010_07)	LocDB	SUBA II	Uniprot (2010_07)
Cytoplasm	4787	1054	1194	912	452	161
Endoplasmic reticulum	1027	367	185	292	285	52
Endosome	409	448	65	6	10	16
Extracellular	2266	380	33	188	–	8
Golgi apparatus	909	503	134	179	171	51
Mitochondrion	884	282	151	724	700	164
Nucleus	4560	2705	1181	1104	1031	326
Peroxisome	131	128	21	240	265	23
Plasma membrane	3940	1702	878	1835	3189	449
Plastid incl. chloroplast	–	–	–	2420	1945	267
Vacuole	297	250	34	862	849	35
Vesicles	258	99	34	–	–	1

^a The numbers in columns show the number of experimentally annotated proteins in each subcellular location in the resources LocDB, LOCATE ( 1 ), SUBA ( 4 ) and UniProt (2010_07) release ( 16 ).

Statistics

Each entry in LocDB has some experimental localization data. However, we have explicit annotations of a particular experiment type for only 25% of the entries. This is a work in progress as, curation is tedious and manual, and we are planning to update details regarding experiments with every new release of LocDB. Most annotations in LocDB are for the nucleus (20%), cytoplasm (20%) and the plasma membrane (20%). Almost two in three of all HS proteins are annotated in one of the largest three compartments (23% nucleus, 25% cytoplasm, 20% plasma membrane). Similarly, two in three of the AT proteins fall into one of the compartments (28% plastid (incl. chloroplast), 21% plasma membrane, 13% nucleus). The distribution of proteins within each region is accessible from the LocDB statistics page http://www.rostlab.org/services/locDB/statistics.php .

Multiple localizations

Many proteins travel, i.e. they stay in more than one subcellular localization at one point of their ‘life’. Most proteins annotated by traditional detailed biochemical experiments, point to one single compartment as the major native environment of each protein ( 19 ) . By contrast, most high-throughput experiments identify most proteins in more than one compartment. Clearly, high-throughput experiments are noisy. Nevertheless, are noisy large-scale experiments closer to the truth than small-scale approaches? The answer remains unclear. About 40% of the LocDB entries have experimental evidence for more than one localization. This may imply that 60% of all proteins are primarily native to a single compartment. In fact, previous analyses suggest a similar value ( 19 ). However, this does not imply that only 40% of the proteins ever ‘travel through’ more than one compartment: many traveling proteins are likely not captured in the experimental data due to limited coverage and limitations in the experimental resolution (false negatives). On the other hand, some fraction of this 40% of proteins evidenced in several localizations may also indicate experimental errors (false positives). It remains unclear how to weigh those effects.

Most proteins unique

LocDB also clusters proteins into families or groups of related proteins ( Figure 1 ). For instance, 1160 (8%) of all HS proteins and 74 (1%) of all AS proteins have more than 98 percentage pair wise sequence identity (PIDE) to another protein in the data set. Clustering at PIDE<25% yields 5587 proteins in HS (42%) and 2744 proteins in AS (47%). This implies that conversely about 7755 proteins annotated in HS and 3518 in AT are sequence-unique at the 25% PIDE threshold. The percentage of proteins with multiple localizations is higher when considering sequence-unique subsets, e.g. while 40% of all proteins are annotated with multiple localizations, 4.6% of those clustered at 98% PIDE and 45% of those clustered at 25% PIDE.

Figure 1.

Clustering of LocDB. We clustered the LocDB entries by BLASTclust ( 26 ) to explore whether or not some families are highly over-represented in LocDB, and found that they are not. For instance, 46% of HS and 43% of AT proteins in LocDB have levels of PIDE <25%, i.e. differ substantially in sequence. On the other end of the spectrum, only 8% of HS and 1% of AT proteins are very similar to each other (PIDE >98%). Note that levels of PIDE>70% usually suffice to infer similarity in localization at levels of about 75% ( 31 ), i.e. for over 80% of the LocDB entries no other entry could be used to predict localization by homology.

Open in new tab Download slide

Experimental and predicted localization

Each LocDB entry corresponds to one protein, and contains protein identifiers, experimental annotations of protein localization, types of experiments performed and the respective publication PubMed ( 20 ) identifiers, as well as predicted localization annotations from LOCtree ( 19 ), WOLFPSORT ( 21 ), MultiLoc ( 22 ), _targetP ( 23 ), PredictNLS ( 24 ) and Nucpred ( 25 ). Prediction results are given in both basic and detailed formats along with the respective reliability and probability scores ( Figure 2 ).

Figure 2.

Example for screen dump from LocDB. The example shows a search with the protein CIPKN_ARATH. Arrows highlight input, output and the distinction between different aspects of the output.

Open in new tab Download slide

Data mining from primary literature

Data for LocDB are collected from reports of many low- and high-throughput experiments. Citations to the appropriate experiments are displayed on the LocDB protein entry pages. Protein sequences and identifiers from the experimental papers are extracted and BLASTed ( 26 ) against UniProt. The sequences with ≥98% PIDE over the entire sequence are assigned UniProt and Ensembl ( 27 ) identifiers for HS and TAIR ( 28 ) identifiers for AT.

Data mining from external databases

Data are also mined from external databases, e.g. LOCATE ( 1 ), SUBA ( 4 ) and many other resources. LocDB reports all the references with the entries in the database which link directly to their PubMed ( 20 ) abstracts.

Comparison with other resources

Many excellent subcellular localization resources are available with experimental annotations of proteins for HT and AT such as LOCATE ( 1 ) for HT and SUBA ( 4 ) for AT. The comparison and overlap between these resources together with UniProt release (2010_07) are shown in Figure 3 a and b. In addition, the comparison in number of proteins annotated in various compartments in these resources is shown in Table 1 . These comparisons show that we have added ∼10 000 human protein localization annotations and ∼900 Arabidopsis protein localization annotations over LOCATE, SUBA and UniProt.

Figure 3.

Comparison between LocDB, UniProt, LOCATE and SUBA for experimental annotations of protein subcellular localizations. ( a ) The Venn diagram shows that LocDB has added annotations for 9469 HS proteins, not annotated in UniProt (2010_07) release ( 16 ) and LOCATE ( 1 ). ( b ) The Venn diagram shows that LocDB has added annotations for 827 AT proteins, not annotated in UniProt (2010_07) release ( 16 ) and SUBA ( 4 ).

Open in new tab Download slide

As mentioned above, UniProt database contains both experimental and general annotations such as ‘Probable’, ‘By similarity’ and ‘Potential’ for protein subcellular locations. A very high level of discrepancies is found in the annotations for locations involved in secretory pathway such as Golgi apparatus, endoplasmic reticulum etc., especially in human proteins (shown in Figure 1 a and b in Supplementary Data ). In Arabidopsis, there is high discrepancy in all the compartments except nucleus and plastid. Comparison with databases DBSubLoc ( 29 ) and eSLDB ( 30 ) is also done; however, they are not shown as the annotations in these resources are mostly derived from Swiss-Prot database.

LocDB will be updated once every 3 months. There is also a provision for users to contribute to the resource by adding information on the contribution page of website as well as by sending an email to locdb@rostlab.org, if they come across any inaccuracies. We will use the database as a portal to access state-of-the-art prediction methods, which will enable users and developers to test prediction methods. We will also add predictions for proteins without experimental annotations that will be clearly marked as predictions. More eukaryotic and prokaryotic proteomes will be available in future through the database such as Escherichia coli and yeast. Moreover, we plan to add curated protein expression data and protein–protein interaction data in the following versions of locDB.

Availability

LocDB data can be retrieved as individual entries or downloaded as HTML and text files from http://www.rostlab.org/services/locDB . The database is a MySQL database and can be obtained upon request ( locdb@rostlab.org ) as an SQL file.

FUNDING

Funding for open access charge: The National Institute of General Medical Sciences (NIGMS; grant R01-GM079767) at the National Institutes of Health (NIH).

Conflict of interest statement . None declared.

ACKNOWLEDGEMENTS

We are pleased to thank Amos Bairoch (SIB, Geneva), Rolf Apweiler (EBI, Hinxton), Phil Bourne (San Diego University) and their crew for maintaining excellent databases. Furthermore, thanks to all experimentalists who enabled this analysis by making their data publicly available.

REFERENCES

Sprenger

Lynn Fink

Karunaratne

Hanson

Hamilton

Teasdale

LOCATE: a mammalian protein subcellular localization database

Nucleic Acids Res.

2008

, vol.

(pg.

D230

D233

)

Elstner

Andreoli

Klopstock

Meitinger

Prokisch

The mitochondrial proteome database: MitoP2

Methods Enzymol.

2009

, vol.

457

(pg.

)

Keshava Prasad

Goel

Kandasamy

Keerthikumar

Kumar

Mathivanan

Telikicherla

Raju

Shafreen

Venugopal

et al.

Human Protein Reference Database – 2009 update

Nucleic Acids Res.

2009

, vol.

(pg.

D767

D772

)

Heazlewood

Verboom

Tonti-Filippini

Small

Millar

SUBA: the Arabidopsis subcellular database

Nucleic Acids Res.

2007

, vol.

(pg.

D213

D218

)

Dellaire

Farrall

Bickmore

The Nuclear Protein Database (NPD): sub-nuclear localisation and functional annotation of the nuclear proteome

Nucleic Acids Res.

2003

, vol.

(pg.

328

330

)

Dunkley

Hester

Shadforth

Runions

Weimar

Hanton

Griffin

Bessant

Brandizzi

Hawes

et al.

Mapping the Arabidopsis organelle proteome

Proc. Natl Acad. Sci. USA

2006

, vol.

103

(pg.

6518

6523

)

Google Scholar

Crossref

WorldCat

Benschop

Mohammed

O'Flaherty

Heck

Slijper

Menke

Quantitative phosphoproteomics of early elicitor signaling in Arabidopsis

Mol. Cell Proteomics

2007

, vol.

(pg.

1198

1214

)

Zybailov

Rutschow

Friso

Rudella

Emanuelsson

Sun

van Wijk

Sorting signals, N-terminal modifications and abundance of the chloroplast proteome

PLoS ONE

2008

, vol.

e1994

Google Scholar

OpenURL Placeholder Text

WorldCat

Jaquinod

Villiers

Kieffer-Jaquinod

Hugouvieux

Bruley

Garin

Bourguignon

A proteomics dissection of Arabidopsis thaliana vacuoles isolated from cell culture

Mol. Cell. Proteomics

2007

, vol.

(pg.

394

412

)

Marmagne

Ferro

Meinnel

Bruley

Kuhn

Garin

Barbier-Brygoo

Ephritikhine

A high content in lipid-modified peripheral proteins and integral receptor kinases features in the arabidopsis plasma membrane proteome, Mol

Cell. Proteomics

2007

, vol.

(pg.

1980

1996

)

Google Scholar

Crossref

WorldCat

Anderson

Polanski

Pieper

Gatlin

Tirumalai

Conrads

Veenstra

Adkins

Pounds

Fagan

et al.

The human plasma proteome: a nonredundant list developed by combination of four separate sources

Mol. Cell Proteomics

2004

, vol.

(pg.

311

326

)

Calvo

Jain

Xie

Sheth

Chang

Goldberger

Spinazzola

Zeviani

Carr

Mootha

Systematic identification of human mitochondrial disease genes through integrative genomics

Nat. Genet.

2006

, vol.

(pg.

576

582

)

Leung

Trinkle-Mulcahy

Lam

Andersen

Mann

Lamond

NOPdb: Nucleolar Proteome Database

Nucleic Acids Res.

2006

, vol.

(pg.

D218

D220

)

Sheng

Chen

Van Eyk

Multidimensional liquid chromatography separation of intact proteins by chromatographic focusing and reversed phase of the human serum proteome: optimization and protein database

Mol. Cell Proteomics

2006

, vol.

(pg.

)

Gassmann

Henzing

Earnshaw

Novel components of human mitotic chromosomes identified by proteomic analysis of the chromosome scaffold fraction

Chromosoma

2005

, vol.

113

(pg.

385

397

)

The UniProt Consortium

The Universal Protein Resource (UniProt) 2009

Nucleic Acids Res

2009

, vol.

(pg.

D169

D174

)

Crossref

PubMed

WorldCat

Boeckmann

Bairoch

Apweiler

Blatter

Estreicher

Gasteiger

Martin

Michoud

O'Donovan

Phan

et al.

The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003

Nucleic Acids Res.

2003

, vol.

(pg.

365

370

)

Ashburner

Ball

Blake

Botstein

Butler

Cherry

Davis

Dolinski

Dwight

Eppig

et al.

Gene ontology: tool for the unification of biology. The Gene Ontology Consortium

Nat. Genet.

2000

, vol.

(pg.

)

Nair

Rost

Mimicking cellular sorting improves prediction of subcellular localization

J. Mol. Biol.

2005

, vol.

348

(pg.

100

)

NLM

Free Web-based access to NLM databases

NLM Tech. Bull

1997

, vol.

296

OpenURL Placeholder Text

WorldCat

Horton

Park

Obayashi

Fujita

Harada

Adams-Collier

Nakai

WoLF PSORT: protein localization predictor

Nucleic Acids Res.

2007

, vol.

(pg.

W585

W587

)

Hoglund

Donnes

Blum

Adolph

Kohlbacher

MultiLoc: prediction of protein subcellular localization using N-terminal _targeting sequences, sequence motifs and amino acid composition

Bioinformatics

2006

, vol.

(pg.

1158

1165

)

Emanuelsson

Brunak

von Heijne

Nielsen

Locating proteins in the cell using _targetP, SignalP and related tools

Nat. Protoc.

2007

, vol.

(pg.

953

971

)

Cokol

Nair

Rost

Finding nuclear localization signals

EMBO Rep.

2000

, vol.

(pg.

411

415

)

Brameier

Krings

MacCallum

NucPred – predicting nuclear localization of proteins

Bioinformatics

2007

, vol.

(pg.

1159

1160

)

Altschul

Madden

Schaffer

Zhang

Miller

Lipman

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

Nucleic Acids Res.

1997

, vol.

(pg.

3389

3402

)

Hubbard

Barker

Birney

Cameron

Chen

Clark

Cox

Cuff

Curwen

Down

et al.

The Ensembl genome database project

Nucleic Acids Res.

2002

, vol.

(pg.

)

Garcia-Hernandez

Berardini

Chen

Crist

Doyle

Huala

Knee

Lambrecht

Miller

Mueller

et al.

TAIR: a resource for integrated Arabidopsis data

Funct. Integr. Genomics.

2002

, vol.

(pg.

239

253

)

Guo

Hua

Sun

DBSubLoc: database of protein subcellular localization

Nucleic Acids Res.

2004

, vol.

(pg.

D122

D124

)

Pierleoni

Martelli

Fariselli

Casadio

eSLDB: eukaryotic subcellular localization database

Nucleic Acids Res.

2007

, vol.

(pg.

D208

D212

)

Nair

Rost

Sequence conserved for subcellular localization

Protein Sci.

2002

, vol.

(pg.

2836

2847

)

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download all slides

Month:	Total Views:
February 2017	1
March 2017	2
April 2017	2
May 2017	4
June 2017	3
July 2017	8
August 2017	6
November 2017	5
December 2017	13
January 2018	6
February 2018	10
March 2018	13
April 2018	16
May 2018	10
June 2018	12
July 2018	16
August 2018	24
September 2018	10
October 2018	10
November 2018	15
December 2018	5
January 2019	6
February 2019	10
March 2019	13
April 2019	12
May 2019	9
June 2019	10
July 2019	15
August 2019	13
September 2019	17
October 2019	8
November 2019	7
December 2019	10
January 2020	12
February 2020	7
March 2020	8
April 2020	3
May 2020	12
June 2020	11
July 2020	3
August 2020	4
September 2020	14
October 2020	7
November 2020	10
December 2020	5
January 2021	14
February 2021	8
March 2021	15
April 2021	7
May 2021	13
June 2021	7
July 2021	12
August 2021	4
September 2021	7
October 2021	11
November 2021	3
December 2021	3
January 2022	8
February 2022	6
March 2022	9
April 2022	15
May 2022	11
June 2022	9
July 2022	12
August 2022	23
September 2022	13
October 2022	16
November 2022	15
December 2022	13
January 2023	7
February 2023	10
March 2023	9
April 2023	16
May 2023	3
June 2023	1
July 2023	15
August 2023	8
September 2023	3
October 2023	11
November 2023	4
December 2023	9
January 2024	16
February 2024	36
March 2024	10
April 2024	15
May 2024	13
June 2024	4
July 2024	22
August 2024	11
September 2024	7
October 2024	6
November 2024	13

Article Contents

LocDB: experimental annotations of localization for Homo sapiens and Arabidopsis thaliana

Abstract

INTRODUCTION

DATA SET

Curated entries with experimental data

Statistics

Multiple localizations

Most proteins unique

Experimental and predicted localization

Data mining from primary literature

Data mining from external databases

Comparison with other resources

Availability

FUNDING

ACKNOWLEDGEMENTS

REFERENCES

Supplementary data

Comments

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

LocDB: experimental annotations of localization for Homo sapiens and Arabidopsis thaliana

Abstract

INTRODUCTION

DATA SET

Curated entries with experimental data

Statistics

Multiple localizations

Most proteins unique

Experimental and predicted localization

Data mining from primary literature

Data mining from external databases

Comparison with other resources

Availability

FUNDING

ACKNOWLEDGEMENTS

REFERENCES

Supplementary data

Comments

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only