PSORTdb
A database of protein subcellular localizations for bacteria

Home - Documentation - Resources - Contact - Updates

PSORTdb documentation

1.
2.
3.
 
3.1.
 
3.2.
4.
 
4.1
 
4.2
 
4.3
 
4.4
 
4.5
5.
6.
 
6.1.
 
6.2.

 

1. Introduction

Identification of a bacterial protein’s SCL provides valuable clues regarding its biological function. For example, surface exposed or secreted proteins are of primary interest due to their potential as vaccine candidates, diagnostic agents (environmental or medical) and the ease with which they may be accessible to drugs. Computational SCL analysis of the growing number of completed bacterial genomes or individual proteins allows researchers to screen for vaccine/drug candidates, automatically annotate gene products or select proteins for further study.

PSORTdb is a database of SCL for bacteria that contains both information determined through laboratory experimentation (ePSORTdb dataset) and computational predictions (cPSORTdb dataset). The dataset of experimentally verified information (~2000 proteins) was manually curated by us and represents the largest dataset of its kind. The second component of this database contains computational analyses of proteins deduced from the most recent NCBI dataset of completely sequenced genomes. Analyses are currently calculated using PSORTb, the most precise automated SCL predictor for bacterial proteins.

PSORTdb database belongs to PSORT family which provides resources and links for subcellular localization prediction.

If you use PSORTdb in your research, we would greatly appreciate if you cited one of the following publications:

  • PSORT-DB: Rey, S., M. Acab, J.L. Gardy, M.R. Laird, K. deFays, C. Lambert, and F.S.L. Brinkman (2005). PSORT-DB: A Database of Subcellular Localizations for Bacteria. Nucleic Acids Research. 33:D164-168. (Database issue)

TOP

 

2. ePSORTdb

ePSORTdb is a dataset of proteins whose subcellular localization (SCL) proteins has been verified by laboratory experimentation. This dataset was used to train PSORTb, our automated SCL predictor for bacterial proteins, and is currently the largest dataset of its kind available to date for bacteria. Localizations were manually assigned to each protein in the dataset by searching the literature (through the PubMed database or other literature sources, such as microbiology textbooks) for experimentally derived information.

The following table describes the subcellular localization sites used to annotate proteins in ePSORTdb.

 
Gram- negative
Gram- positive
Single SCL
 
 
Cytoplasmic (C)
Cytoplasmic (C)
 
Cytoplasmic membrane (CM)
Cytoplasmic membrane (CM)
 
Periplasmic (P)
 
-
Cell wall (CW)
 
Outer membrane (OM)
-
 
Extracellular (E)
Extracellular (E)

Multiple SCL
 
 
C/CM
C/CM
 
CM/P
CM/CW
 
P/OM
 
OM/E

As of August 2004, ePSORTdb version 2.0 includes 2165 bacterial proteins (1591 from Gram-negative and 574 from Gram-positive bacteria).

ePSORTdb can be searched using a variety of fields. The specific fields available in ePSORTdb are described below.

TOP

 

3. cPSORTdb

cPSORTdb is a dataset of protein localizations that were predicted by computational methods. To date, 140 bacterial genomes available through the NCBI have been analyzed by both PSORTb version 1.1.2 and version 2.0. In the future, we woud like to incorporate the results of other SCL predictive methods into cPSORTdb.

The long format predictions generated by PSORTb versions 1.1.2 and 2.0 methods are stored in cPSORTdb and are fully browsable and searchable. Descriptions of the two versions of PSORTb are given below.

3.1. PSORTb v.1.1.2

PSORT-B v.1.1.2 is designed for Gram-negative bacterial proteins and consists of six analytical modules:

  • SCL-BLAST, or SubCellular Localization BLAST
  • Motif Analysis
  • Outer Membrane Motif Analysis
  • HMMTOP
  • SubLocC
  • Signal Peptide

Each module analyzes one biological feature known to influence or be characteristic of subcellular localization. The modules may act as a binary predictor, classifying a protein as either belonging or not belonging to a particular localization site, or they may be multi-category, able to assign a protein to one of several localization sites.

In order to generate a final prediction, the results of each module are combined and assessed. A probabilistic method and 5-fold cross validation were used to assess the likelihood of a protein being at a specific localization given the prediction of a certain module. These likelihoods are used to generate a probability value for each of the five localization sites for a user's query protein.

When analyzing a Gram-negative organism, the 5 possible localization sites are: cytoplasm, cytoplasmic membrane, periplasm, outer membrane and extracellular space. PSORT-B v.1.1.2 returns a list of these five localization sites and the associated probability value for each, ranked in descending order. A cutoff of 7.5 or above is used to return a final prediction, otherwise a result of "Unknown" is returned.

See below for database fields associated with PSORTb v.1.1.2 , or read more information about PSORTb v.1.1.2 modules.

Submit a protein to PSORTb v.1.1.2.

TOP

3.2. PSORTb v.2.0

PSORTb v.2.0 is designed for both Gram-negative and Gram-positive bacterial proteins and again consists of multiple analytical modules:

  • SCL-BLAST & SCL-BLASTe, or SubCellular Localization BLAST
  • Support Vector Machines (SVMs)
  • Motif & Profile Analysis
  • Outer Membrane Motif Analysis
  • HMMTOP
  • Signal Peptide

The same integration and weighting process as PSORTb v.1.1.2 (see above) is used to generate a final prediction. For Gram-negative organisms, the 5 possible localization sites are: cytoplasm, cytoplasmic membrane, periplasm, outer membrane and extracellular space. For Gram-positive bacteria, 4 localization sites are possible: cytoplasm, cytoplasmic membrane, cell wall and extracellular space. The 7.5+ cutoff is again used.

See below for database fields associated with PSORTb v.2.0, or read more information about PSORTb v.2.0 modules.

Submit a protein to PSORTb v.2.0.

 

TOP

4. Accessing ePSORTdb and cPSORTdb

4.1. Text search

A. Simple search input text field - allows user to enter keywords.
B. PSORTb version check box (only available in cPSORTdb) - selects version of PSORTb predictions.
C. Drop down menu - selects a field to search against.
D. Advanced search input text field - allows user to enter keywords.
E. Help message box - displays description and/or example of what type of text can be entered for the selected search field
F. Boolean operator - allows user to make complex queries.

 

Simple search

Simple search (A) allows the user to perform keyword searches against all text fields of the database. The wildcard % (representing zero to many wild characters) will be added between each keyword.

Simple searches can be carried out against ePSORTdb and cPSORTdb. In the latter case, the search is carried out by default against the predictions from PSORTb v.2.0 (the most recent version of PSORTb).

Advanced search

Advanced searches can be done by specifying keywords in one or more text fields. Available fields vary according to the database in question (follow the links for a list and description of the fields available in ePSORTdb and cPSORTdb). When you choose a particular field from the drop down menu list (C), an example of possible values appropriate to this field is automatically displayed in the help message box (E) on the far right, letting you know the correct keywords or the correct syntax to be used. You can then either highlight keywords from the help message box and use the double arrow button to send them to the input text search box (D), or you may type your own keywords into box D. In each input search text (D) box you can have more than one keyword. These will be separated by an OR when searching within the selected field. If you want to search multiple keywords in different fields, you can use Boolean operators (F) to combine up to 4 fields.

In cPSORTdb advanced searches, a check box allows (B) the user to choose among predictions from different versions of the programs available (currently PSORTb versions 1.1.2 and 2.0).

Localization score fields (e.g. "PSORTb predicted SCL score" or "cytoplasmic score") are grouped under the category field called >All scores. If you select this particular field, the query will be run against all localization score fields grouped under this category.

In the same manner, reference fields (PubMed ID, Title, ISBN number, WWW and comments) are also grouped under the category field called >Reference summary.

Furthermore:

  • Keywords are case insensitive
  • Spaces and quotation marks will be included in the search.
  • % represents zero to many wild characters.
  • _ (underscore character) represents a single wildcard character.
  • NB: wildcards are not automatically included in advanced queries. In other words, every keyword will be an exact match search. If you wish to perform a substring search within a text field, you have to add wildcards (% or _ )to your query. Entering the sole keyword _% will only return entries that have some data in the selected field (this feature is only functional for text fields; while the help message box does not always indicate field type, numerical fields are explicitly mentioned). Queries within numerical fields are achieved using mathematical symbols, such as <, >, >=, <= or =. Some examples are shown in the following table.

    Field

    Entered value
    Search results
    Protein name synthetase no results
      synthetase% all proteins which name starts with synthetase (2 hits)
      %synthetase% all proteins which name contains the keyword synthetase (5,724 hits)
    GI <=19052904 all proteins which GI is less or equal to 19052904
    HMMTOP
    Helices Count
    8<=range<=10 all proteins which have between 8 and 10 predicted helices by the HMMTOP module.
    Gene Name yxb_ all proteins which gene name has 4 characters and starts with yxb

    Results of a text search are returned as a table. For an explanation of this table, see 4.4 Result Display.

    TOP

    4.2. BLAST search

    Both ePSORTdb and cPSORTdb datasets can be searched using the BLASTP program. One or more proteins can be submitted at the same time, these must be in FASTA format. A sequence with a FASTA sequence file consists of three parts:

  • A title line, which must begin with a '>' symbol, and may be followed by any type of text
  • A new line character at the end of the title line
  • The sequence itself, which continues until the end of file or the next `>' is reached
  • An example of FASTA format is shown below:

    >sp|O52956|A85A_MYCAV Antigen 85-A precursor (85A)
    MTLVDRLRGAVAGMPRRLVVGAAGAALLSGLIGAVGGSATAGAFSRPGLPVEYLQVPSAAMGRD
    IKVQFQSGGANSPALYLLDGMRAQDDFNGWDINTPAFEWYNQSGISVAMPVGGQSS FYSDWY
    KPACGKAGCTTYKWETFLTSELPQYLSAQKQVKPTGSGVVGLSMAGSSALILAA YHPDQFVYAG
    SLSALLDPSQGMGPSLIGLAMGDAGGYKAADMWGPKEDPAWARNDPSLQV GKLVANNTRIWV
    YCGNGKPSDLGGDNLPAKFLEGFVRTSNLKFQDAYNGAGGHNAVWNFD ANGTHDWPYWGA
    QLQAMKPDLQSVLGATPGAGPATAAATNAGNGQGT

    For more information, see the description at NCBI.

    Results of a BLAST search are presented through the standard BLASTP layout, displaying the retrieved proteins with their associated parameters. The different sections of the result page are briefly described here:


    Query= sp|O52956|A85A_MYCAV Antigen 85-A precursor (85A) (347 letters)

    Header of the submitted protein with its sequence length in brackets.

     


    Database: ePSORT.faa 2165 sequences; 1,017,765 total letters

    Name and size of the database which was searched (ePSORT.faa for ePSORTdb and cPSORT.faa for cPSORTdb).

     


    Sequences producing significant alignments:

    Score
    (bits)

    E
    Value

    gi|13431272 Extracellular|Antigen 85-A precursor|Mycobacter...
    445
    e-126
    ...
    gi|13475694 OuterMembrane|general secretion proteinD|Mesorh...
    24
    8.8

    Summary of the retrieved proteins with their Score and E-value. Two links are available for each protein, one (e.g. gi|13431272) pointing to the complete PSORTdb entry of the protein and the other (e.g. 445) jumping further down on the page to the detailed results of the BLAST search. The definition line of the retrieved proteins contains their GI number, their subcellular localization (experimentally confirmed for ePSORTdb and predicted for cPSORTdb) their name and their source organism.

     


    >>[ePSORTdb] [NCBI] gi|13431271 Extracellular|Antigen 85-A
      precursor|Mycobacterium gordonae
     
    Length = 339
    Score = 403 bits (1036), Expect = e-114
    Identities = 197/240 (82%), Positives = 204/240 (85%), Gaps = 1/240 (0%)
    Query: 1 MTLVDRLRGAVAGMPRRXXXXXXXXXXXXXXXXXXXXXXTAGAFSRPGLPVEYLQVPSAA 60
      M LVDR RGAV GMPRR                      TA AFSRPGLPVEYLQVPSAA
    Sbjct: 1 MKLVDRFRGAVTGMPRRLMVGAVGAALLSGLVGFVGGSATASAFSRPGLPVEYLQVPSAA 60
    ...  

    Detailed results of the BLAST search for a retrieved protein are shown above. Two links are available for each protein, the first one (called [ePSORTdb] or [cPSORTdb]) allows the user to view the protein's complete entry in our PSORTdb database and the second one (called [NCBI]) points the to protein's entry in the NCBI database.

    TOP

    4.3. Browse datasets

    A. Current level of the browsing - allows user to return to a higher level of the database.
    B. Next available levels - allows user to proceed to a lower level of the database.

    Both datasets can be browsed by SCL, organism, phylum, class, Gram stain and genome in every possible combination. When "List all proteins at Current Level" is selected, the results are returned as a table. For an explanation of this table, see 4.4 Result Display.

    As an example, the browse function might be used in the cPSORTdb dataset, to retrieve all predicted localization for a specific genome, by selecting "Organism" at the first level, then by selecting "your favorite microbe, and then choosing "Current Level". From the results page, all the information associated with this organism can be downloaded either in tab delimited or FASTA format.

    TOP

    4.4. Result display

    A. Sort columns - sorts by ascending or descending order (up to 3 fields can be used in sorting).
    B. Displayed columns - adds or removes selected columns from the results view.
    C. Results per page - changes the number of records per page.
    D. Download options - downloads the results.
    E. Total number of results and pages -navigates from one result page to another.
    F. Results view - displays search results.

    Results of browsing and keyword searches can be viewed page by page. Initially a default set of fields are displayed as columns but there are numerous options available that will allow the user to customize their result listing:

    • Up to 3 user-selected fields can be sorted simultaneously in ascending or descending order (A).
    • The user can select which columns to view (B).
    • The number of records per page can be changed from the default of 10 (C).

    Furthermore, from the results list the user can:

    • Download the results list (D). See also 4.5 Result download
    • See the number of pages in the Results list, and jump to another page (E)
    • Click on the protein's GI accession number in the Results view (F) to get detailed annotation - (information in all available fields as well as the amino acid sequence of the protein) for each entry.
    TOP

    4.5. Result download

     

    From the Results view page, the user can download their search results (D). Results may be downloaded in two different formats:

  • A Tab delimited text file containing those fields currently displayed in Results view.
  • A FASTA file containing the amino acid sequences of the proteins in the results list.
  •  

    TOP

     

    5. ePSORTdb fields

    The following table lists all available ePSORTdb fields. Default fields displayed in results pages are in bold. Numerical fields are followed by a #.

     Field name  Description
    GI # GI number of NCBI, primary identification key, unique to each protein
    Swiss-Prot ID Swiss-Prot primary accession number of the protein
    Protein name The name of the protein
    Alternate protein name Alternative names of the protein
    Gene name The name of the gene
    Alternate gene name Alternative names of the gene
    Organism Genus and species of the protein source organism
    Taxonomy ID # NCBI taxonomy identifier of the source organism
    Gram stain Gram classification of the source organism
    Amino acid sequence Amino acid sequence of the protein
    Sequence length Number of amino acids in the protein sequence

    Experimental SCL(terse) Experimentally verified SCL (terse format)*
    Experimental SCL (verbose) Experimentally verified SCL (verbose format)**
    GO Accession ID Gene Ontology (GO) accession identifier
    GO Accession Definition Gene Ontology (GO) accession definition
    PubMed ID reference PubMed identifier of literature references
    Reference title Title of literature reference books
    ISBN number reference ISBN identifier of literature reference books
    WWW reference Internet adress of www references
    Reference comments Comments re: literature references
    References summary Concatenation of all reference fields (PubMed ID, title, ISBN number, WWW and comments)

    *strict SCL terminology as returned by PSORTb:
     Gram-negative: Cytoplasmic, CytoplasmicMembrane, Periplasmic, OuterMembrane
     and Extracellular
     Gram-positive: Cytoplasmic, CytoplasmicMembrane, Cellwall and Extracellular

    ** e.g. Cell wall surface exposed (LPxTG motif) protein

    # numerical field

     

    TOP

     

    6. cPSORTdb fields

    6.1. cPSORTdb general fields

    The following table lists the general cPSORTdb fields which identify the protein and its source genome/organism. Default fields displayed in results pages are in bold. Numerical fields are followed by a #.

     Field name  Description
    Chromosome Acc ID NCBI chromosome accession identifier associated with the protein (NC_00XXXX)
    GI # NCBI GI number of the protein (primary identification key, unique to each protein)
    RefSeq Accession ID RefSeq accession identifier of the protein
    Protein name The name of the protein
    Gene name The name of the gene
    Alternate gene name The alternative name of the gene
    Taxonomy ID # NCBI taxonomy identifier of the source organism
    Organism Genus and species of the source organism
    Phylum Phylum of the source organism
    Class Class of the source organism
    Gram stain Gram classification of the source organism
    Amino acid sequence Amino acid sequence of the protein
    Sequence length # Number of amino acids in the protein sequence
    ePSORTdb GI Link # GI of the protein in the current ePSORTdb dataset*

    *: If the protein in cPSORTdb is identical at the sequence level and the species level to a protein in ePSORTdb, a link to the ePSORTdb record will appear here.

    # numerical field

     

     

    TOP

    6.2. PSORTb v.1.1.2 & v.2.0 prediction fields

    The following table lists the cPSORTdb fields which contain information regarding computationally predicted SCLs. Default fields displayed in results pages are in bold.

     Field name  Description
    SCL-BLAST localization Predicted localization site (all possible SCL*)
    SCL-BLAST details Protein GI of ePSORTdb dataset
    Motif localization Predicted localization site (all possible SCL*)
    Motif details Motif accession number from PROSITE (list Gram neg. & pos.)
    OMPmotif localization (N) Predicted localization site (either outer membrane or unknown)
    OMPmotif details (N) OMPmotif accession number (list)
    HMMTOP localization (#) Predicted localization site (all possible SCL*)
    HMMTOP helices count
    Number of predicted helices
    Signal localization Predicted localization site (all possible SCL*)
    Signal details Presence or not of a signal peptide
    Profile localization (§) Predicted localization site (all possible SCL*)
    Profile details (§) Profile accession number from PROSITE (list Gram neg. & pos.)
    SCL-BLASTe localization (§) Predicted localization site (all possible SCL*)
    SCL-BLASTe details (§) Protein GI of ePSORTdb dataset
       
    CytoSVM localization (§) Predicted localization site (either cytoplasmic or unknown)
    CMSVM localization (§) Predicted localization site (either cytoplasmic membrane or unknown)
    PPSVM localization (N,§) Predicted localization site (either periplasmic or unknown)
    CWSVM localization (P,§) Predicted localization site (either cell wall or unknown)
    OMSVM localization (N,§) Predicted localization site (either outer membrane or unknown)
    ECSVM localization (§) Predicted localization site (either extracellular or unknown)
    SubLocC localization (N,§§) Predicted localization site (either cytoplasmic or unknown)

    Cytoplasmic score (#) Probability of cytoplasmic localization returns by PSORTb
    Cytoplasmic membrane score (#) Probability of cytoplasmic membrane localization returns by PSORTb
    Periplasmic score (#,N) Probability of periplasmic localization returns by PSORTb
    Cell wall score (#,P) Probability of cell wall localization returns by PSORTb
    Outer membrane score (#,N) Probability of outer membrane localization returns by PSORTb
    Extracellular score (#) Probability of extracellular localization returns by PSORTb

    Predicted Localization ** Localization site returned by PSORTb
    GO Accession ID Gene Ontology (GO) accession identifier
    GO Accession Definition Gene Ontology (GO) accession definition
    Predicted Localization Score (#) Probability of PSORTb predicted localization

    *for Gram-negative; : cytoplasmic, cytoplasmic membrane, periplasmic, outer membrane, extracellular
     and for Gram-positive cytoplasmic, cytoplasmic membrane, cell wall, extracellular

    **: Localization site returns by PSORTb specific to Gram-negative or positive .
    N: available only for Gram-negative
    P: available only for Gram-positive
    # numerical field
    § only available in PSORTb versin 2.0
    §§ only available in PSORTb version 1.1.2

     

    TOP


    Home - Documentation - Resources - Contact - Updates