UIUC Home Page
 


Documentation
Introduction
Input
Output

 
Sequences

Overview
Apis
Drosophila
Insect Insulin-like
Various Insect
Mammalian (Amare)
Cattle
Human
Mouse
Rat
Rhesus monkey
Cattle (Genome)
Chicken
Zebra Finch


NeuroPred versions

Current
Test 2009
2006 NAR

 
    

NeuroPred: Input Documentation


A. Overview of using NeuroPred

B. Sequence Submission

1. Sequence Format

2. Sequence Entry Errors

C. Model Selection

Available Models

D. Output Selection Tasks

E. Model and Mass Calculations Options

1. Simplified Options Interface

2. Advanced Options Interface

F. Post-Translational Modifications

G. References


A. Overview of Using NeuroPred

This document provides a description of the outputs of NeuroPred. A description and example usage of an earlier interface to NeuroPred (2006 NAR) was provided by Southey et al. (2006b) that varies slightly from this version. A newer version of NeuroPred, Test 2009, is very similar to this version but can use artifical neural network models described in Tegge et al. (2007, 2008) and Southey et al. (2008).

The basic use of NeuroPred requires:

  1. Entering one or more sequences in a required format using one of two methods:
    1. The sequence submission box.
    2. A file selected using the file upload box:

  1. Optionally selecting:
    1. One or more models in the Model Section box.
    2. One or more options from the Other Options box.
  1. Clicking on the submit button:

NeuroPred will attempt to predict cleavage sites in the submitted sequences and perform additional functions as determined by the different options selected. The output is explained in the Output Documentation.

Optionally, clicking on the  button provides a range of enhanced options that control the input and output of NeuroPred. The simplified form can be accessed just by selecting the  button. The simplified form is generally the same as the more advanced form with the main defaults automatically selected. Therefore, there is no difference between using the simplified form and the advanced form with the default settings.

Changing the default settings allows NeuroPred to use multiple models, calculate model accuracy statistics and obtain the mass of predicted peptides. The following sections explain the various settings of NeuroPred that can be changed.

TOP

B. Sequence Submission

A sequence is the only required input to NeuroPred and is the same for either the simplified or advanced option. Sequences that are in the required format may be entered into the sequence submission box, or uploaded from a text file; multiple sequences can be submitted at the same time. An uploaded sequence file takes priority over those entered in the sequence submission box; therefore, only sequences in the file will be used, even if sequences were also entered into the sequence submission box.

TOP

1. Sequence Format

There are two supported formats that are based on the FASTA format: the Basic Format and the Known Cleavage Format, which is an extension of the Basic Format that includes known cleavage information.

Basic Format: This format begins with a single-line description or header with a greater-than (">") symbol in the first column that is followed by lines of sequence data. The sequence can be provided in either upper or lower case on a new line following the header line. Spaces or other white space characters (such as tabs or new lines) can be included such that the sequence may span several lines but other non-sequence characters will result in an incorrect sequence. An example of the Basic Format for the Human proglucagon precursor sequence is:

                  >Human_proglucagon
                  MKSIYFVAGL FVMLVQGSWQ RSLQDTEEKS RSFSASQADP LSDPDQMNED 
                  KRHSQGTFTS DYSKYLDSRR AQDFVQWLMN TKRNRNNIAK RHDEFERHAE
                  GTFTSDVSSY LEGQAAKEFI AWLVKGRGRR DFPEEVAIVE ELGRRHADGS
                  FSDEMNTILD NLAARDFINW LIQTKITDR

Known Cleavage Format: An extension of the Basic Format that includes the known cleavage sites. This format is required in order to compute model accuracy statistics. The known cleavage sites must be entered immediately after the sequence on a new line. For every amino acid in the sequence a corresponding zero (0) or one (1) must be present where 0 and 1 denote non-cleavage and cleavage, respectively. All cleavages are assumed to occur C-terminal to the indicated amino acid. This binary sequence must be the same length as the sequence and, like the sequence, can include white space characters (such as spaces, tabs and new lines) and can span multiple lines.

                  >Human_proglucagon
                  MKSIYFVAGL FVMLVQGSWQ RSLQDTEEKS RSFSASQADP LSDPDQMNED
                  KRHSQGTFTS DYSKYLDSRR AQDFVQWLMN TKRNRNNIAK RHDEFERHAE
                  GTFTSDVSSY LEGQAAKEFI AWLVKGRGRR DFPEEVAIVE ELGRRHADGS
                  FSDEMNTILD NLAARDFINW LIQTKITDR
                  0000000000 0000000000 0000000000 0000000000 0000000000
                  0100000000 0000000000 0000000000 0010000000 1000001000
                  0000000000 0000000000 0000000001 0000000000 0000100000
                  0000000000 0000000000 000000000

TOP

2. Sequence Entry Errors

Sequence Entry Errors: NeuroPred will automatically ignore any sequences in incorrect formats and the error message, "There were either no sequences entered or sequences were entered in the wrong format" is displayed when there are no valid sequences submitted.

Known Cleavage Format: A warning message, "Warning: Cannot do model accuracy because either there is no data or something is wrong with the data format" is displayed if at least one sequence has an incorrect format. In this situation, NeuroPred will only predict cleavage sites and display the associated output. An additional warning message is displayed when there is a mismatch between the sequence and the binary input.

Other possible sequence input errors that may cause unexpected results include sequences that do not start on a new line immediately after the header, the header line spanning more than one line, or non-standard characters.

TOP

C. Model Selection

The model selection options are the same in both interfaces; however, the advanced options interface provides a wider range of models. All models except the Known Motif were trained on specific organisms or related species. Generally the most correct predictions are obtained when the model is trained either on the same or related species. However, because no single prediction model is ideal, the simultaneous use of complementary methods is allowed and encouraged. The Model Accuracy Statistics from the Output Selection Task can be used to assess the performance of different models to assist in selecting the appropriate models.

TOP

Available Models

The following four cleavage prediction models are available in both interfaces, the first model is an empirical model, whereas all other models are binary logistic regression models.

The following models are available for both interfaces:

Known Motif:
This model, reported by Southey et al. (2006a), is based on the occurrence of the following motifs in the sequence: xxKR, xxKK, xxRR, RxxR and RxxK, where x is any amino acid and cleavage occurs after the right-most or C-terminal amino acid. When any of these motifs is present, the cleavage probability is set to 0.88 and when two of these motifs are present (e.g. RxKR, which is composed of the motifs xxKR and RxxR), the cleavage probability is set to 0.997. This is the default model.
Mollusc:
The Complex Model, reported by Hummon et al. (2003), trained on precursor sequences from the mollusc, Aplysia californica.
Mammalian:
Model reported by Amare et al. (2006), trained on published mammalian precursor sequences from cow, human, mouse, pig and rat.
Insect:
A joint model derived from the models trained by Southey et al. (2008) on published precursor sequences from the honey bee, Apis mellifera and fruit fly, Drosophila melanogaster.

The following binary logistic regression models are only available with the advanced option interface:

Mollusc_Basic:
Model reported by Hummon et al. (2003), trained on precursor sequences from the mollusc, Aplysia californica. This model has fewer terms than the Mollusc model and also had poorer predictive ability than the Mollusc mode (Hummon et al. 2003).
Apis:
Model reported by Southey et al. (2008), trained on published precursor sequences from the honey bee, Apis mellifera, reported by Hummon et al., 2006.
Drosophila:
Model reported by Southey et al. (2008), trained on precursor sequences from the fruit fly, Drosophila melanogaster
Bovine:
A model trained on published cattle, Bos taurus, precursor sequences by Tegge et al. (2007, 2008).
Human:
A model trained on published human, Homo sapiens, precursor sequences by Tegge et al. (2007, 2008).
Mouse:
A model trained on published mouse, Mus musculus, precursor sequences by Tegge et al. (2007, 2008).
Rat:
A model trained on published rat, Rattus norvegicus, precursor sequences by Tegge et al. (2007, 2008).
Any Basic Site:
Simply considers any basic amino acid (Arginine or Lysine) as cleaved. However, when the amino acid combinations listed in the known motif model (e.g. xxKR, xxKK, xxRR, RxxR and RxxK) are present, only the last amino acid in the motif is considered as cleaved.

The newer version of NeuroPred, Test 2009 permits the use of following models under the advanced option interface that include amino acid physicochemical properties and models trained using artifical neural networks. The insect models were developed by Southey et al. (2008) and the human, mouse, bovine and rat models were developed by Tegge et al. (2007, 2008) :

Bovine amino acids plus properties Log Reg:
Logistic regression model trained on bovine sequences that uses both amino acids and amino acid physicochemical properties.
Human amino acids plus properties Log Reg:
Logistic regression model trained on human sequences that uses both amino acids and amino acid physicochemical properties.
Mouse amino acids plus properties Log Reg:
Logistic regression model trained on mouse sequences that uses both amino acids and amino acid physicochemical properties.
Rat amino acids plus properties Log Reg:
Logistic regression model trained on rat sequences that uses both amino acids and amino acid physicochemical properties.
Apis Neural Net:
Artifical neural network model trained on Apis sequences that uses only amino acids.
Drosophila Neural Net:
Artifical neural network model trained on Drosophila sequences that uses only amino acids.
Bovine amino acids only Neural Net:
Artifical neural network model trained on bovine sequences that uses only amino acids.
Human amino acids only Neural Net:
Artifical neural network model trained on human sequences that uses only amino acids.
Mouse amino acids only Neural Net:
Artifical neural network trained on mouse sequences that uses only amino acids.
Rat amino acids only Neural Net:
Artifical neural network model trained on rat sequences that uses only amino acids.
Bovine amino acids plus properties Neural Net:
Artifical neural network model trained on bovine sequences that uses both amino acids and amino acid physicochemical properties.
Human amino acids plus properties Neural Net:
Artifical neural network model trained on human sequences that uses both amino acids and amino acid physicochemical properties.
Mouse amino acids plus properties Neural Net:
Artifical neural network trained on mouse sequences that uses both amino acids and amino acid physicochemical properties.
Rat amino acids plus properties Neural Net:
Artifical neural network model trained on rat sequences that uses both amino acids and amino acid physicochemical properties.

TOP

Differences between Models

The binary logistic regression models vary in both the number of locations surrounding the cleavage site and the specific amino acids at these locations. Details on the specific composition of the models as well as description of training data and performance of each model can be found in the respective references. The general differences between the models are summarized in the following table.

    Summary of Prediction Models
    Model Window Length Location of Cleavage Site Number of Amino Acid-Location Combinations Used Example of Window
    Known Motif 4 4 4      DSRR
    Mollusc 13 5 10     LDSRRAQDFVQWLM
    Mammalian 18 9 11 YSKYLDSRRAQDFVQWLM
    Insect 15 6 20    YLDSRRAQDFVQWLM
    Mollusc_Basic 13 5 24     LDSRRAQDFVQWLM
    Apis 7 4 9      DSRRAQD
    Drosophila 15 6 20    YLDSRRAQDFVQWLM
    Bovine

    18

    9 67 YSKYLDSRRAQDFVQWLM
    Human 18 9 34 YSKYLDSRRAQDFVQWLM
    Mouse 18 9 101 YSKYLDSRRAQDFVQWLM
    Rat 18 9 71 YSKYLDSRRAQDFVQWLM

    Description of Summary of Prediction Models Columns:

    Window Length:
    Number of locations surrounding the site used by the model to predict cleavage. This is typically smaller than the number of locations used in training the model.
    Location of Cleavage Site:
    The location of the cleaved site in the window from the left-hand or N-terminal end of the window.
    Number of Amino Acid-Locations Combinations Used:
    This is the maximum number of possible combinations of amino acids and locations within the window that are used by the model. A model can have many amino acids at the same location.
    Example of Window:
    An example of a possible window with the cleavage site denoted by the bold amino acid for the same sample sequence such that actual cleavage occurs to the right or C-terminal to this amino acid.
TOP

D. Output Selection Tasks

There are four Output Selection Tasks available:

Output Selection Tasks

Description of Output Selection Tasks:

Only Predict Cleavage Sites:
Predicts cleavage sites for submitted sequences using the selected cleavage prediction models. This is the default task.
Obtain Mass of Predicted Peptides:
Predicts cleavage sites for submitted sequences using the selected cleavage prediction models and calculates the average and monoisotopic molecular masses of predicted peptides resulting from cleavage and any selected post-translational modifications.
Model Accuracy Statistics:
Predicts cleavage sites for submitted sequences and calculates model accuracy statistics for all submitted sequences and selected cleavage prediction models. In order to obtain model accuracy statistics, the known cleavage format is required. A warning message, "Warning: Cannot do model accuracy because either there is no data or something is wrong with the data format" is provided if at least one sequence has an incorrect format and model accuracy statistics are not computed for any sequence. See Output Documentation for further details.
Print Probabilities of Basic Sites Only:
Only prints the predicted probabilities of the basic sites within the submitted sequences.
TOP

E. Model and Mass Calculation Options

A variety of options are available to control the input and output of NeuroPred depending on whether the simplified or advanced option interface is used.

1. Simplified Options Interface

Other Options
Display Cleavage Probabilities? (select for yes)
Input the length of the signal peptide (use zero (0) for no signal peptide)
Sort the output from mass calculations on
Remove any C-terminal K and R from predicted peptides? (select for yes)
Select Post-Translational Modifications (PTMs)

Description of Other Options:
Display Cleavage Probabilities?
This option permits the display of the actual probabilities calculated for the basic amino acids in the sequence. This option is unselected by default so the probabilities are not displayed unless this box is selected.
Input the length of the signal peptide:
This option provides the default length of the signal peptide when a Signal length identifier is not used. A default value of 15 is used as determined by the smallest known example in a wide range of known precursors. Use zero when there is no signal peptide present such as for sequence fragments.
Sort the output from mass calculations on:
This option is only valid when the Obtain Mass of Predicted Peptides task is selected and, therefore, it is ignored with all other tasks. By default the predicted peptides and associated masses are not sorted but this option permits the output to be sorted by different criterion. The predicted peptides can be sorted from low to high by Average mass or Monoisotopic mass, or by the location of the predicted peptide Sequence based on location in the original input sequence and any post-translational modifications applied to that peptide.
Remove any C-terminal K and R from predicted peptides?
This option is only valid when the Obtain Mass of Predicted Peptides task is selected and, therefore, it is ignored with all other tasks. This option permits the removal of C-terminal basic amino acids (Arginine and Lysine) from the cleaved sequence and is selected by default.
Select Post-Translational Modifications (PTMs):
This option is only valid when the Obtain Mass of Predicted Peptides task is selected and, therefore, it is ignored with all other tasks. This option permits the selection of different groups of post-translational modifications (See Section F. Selection of Post-Translational Modifications for more details):
No PTMs selected:
No post-translational modifications are selected and is the default selection.
Common PTMs:
Common post-translational modifications (Amidation, Pyroglutamylation, Acetylation and Tyr-Sulfation) are applied.
All PTMs:
All post-translational modifications listed under the Advanced Options Interface are applied.
TOP

2. Advanced Options Interface

Modeling and Mass Calculation Options
Threshold cleavage probability (0-1.0): Confidence interval significance level (0.005-0.20):
Number of amino acids surrounding cleavage site: Before After
Number of amino acids in the signal sequence: Signal length identifier:
Lower mass value (Da): Upper mass value (Da):
Maximum number of amino acids in a peptide: Degree of peptide extension:
Sort the output from mass calculations on    
Ignore processing rules: Use basic sites for accuracy statistics?

Description of Modeling and Mass Calculation Options:

Threshold cleavage probability:
The minimum value that the predicted cleavage probability must exceed to declare the associated site as being cleaved. The default threshold cleavage probability is 0.50 and can range from 0 to 1.0. A low threshold cleavage probability will increase the number of true positives, while a higher threshold cleavage probability will increase the number of true negatives.
Confidence interval significance level:
The significance level, α, required to generate a 100*(1-α)% confidence interval. The default α value is 0.05 and it can range from 0.005 to 0.2. The α value of 0.05 means that the true cleavage probability lies within the confidence interval range in 95% of the samples. The confidence interval range can be used as an indicator of the uncertainty associated with cleavage prediction where a narrow interval range implies reliable prediction and a wider range implies a less reliable prediction.
Number of amino acids surrounding cleavage site:
Specifies the minimum number of amino acids that must surround the cleavage site for cleavage to occur and is currently used to avoid predicting cleavage near the sequence termini. The Before option denotes the number of amino acids N-terminal to the cleavage site and the After option denotes the number of amino acids C-terminal to the cleavage site. The default values are based on empirical observation of known cleavages and information from the crystal structure of furin.
Signal peptide length:
Located at the N-terminus of a precursor, the signal peptide is a transport marker needed for the translocation of the precursor into the lumen of the endoplasmic reticulum and is co-translationally removed by a signal peptidase. Consequently, this region must be removed prior to prediction of cleavage sites. The predicted length of the signal peptide for any sequence can be obtained using the SignalP application. There are two ways to account for the signal peptide:
Number of amino acids in the signal sequence:
This option specifies a global length of the signal peptide for all submitted sequences so that no prediction occurs in this region. The default value is 15 but can range from 0 to the length of the sequence. If an invalid value is found, NeuroPred defaults to signal peptide length of 15.
Signal length identifier:
An individual length for each sequence can be indicated in the header of the FASTA format by the "Signal length identifier". The default signal length identifier is SignalP= and can be changed to any value. The presence of a valid value using this identifier will override the default global length of the signal peptide; thus, this specification is useful when there are multiple sequences that have varying signal peptide lengths. The signal length identifier is case insensitive such that SIGnalP is treated the same as signalp. The number of amino acids in the signal sequence must immediately follow this identifier and a space must occur after the number of amino acids. For example, "SignalP=25 aa" is valid specification but "SignalP=25aa" is invalid. If an invalid value is found, NeuroPred uses the default value of 15.
Lower mass value:
Predicted peptides with masses lower than this value are not reported. The default value is 0 Da.
Upper mass value:
Predicted peptides with masses higher than this value are not reported. The default value is 15,000 Da.
Maximum number of amino acids in a peptide:
Does not output cleaved peptides that have more amino acids than this value, thus limiting the number of reported final products. The default is set to 1000 amino acids.
Degree of peptide extension:
In order to account for false positive (cleaved) predictions, the predicted peptide can be extended to include as many adjacent peptides as specified in the "Degree of peptide extension" box. This option greatly increases the number of peptides in the output but can be useful for peptidomics studies.The figure below illustrates how extending the peptide increases the combinations of predicted peptides. The degree of peptide extension can range from 0 to any integer. As shown in the example, a zero degree of extension produces the shortest possible peptides whereas higher degrees of extension produce longer peptides. When the degree of extension exceeds or is equal to the total number of predicted cleavage sites, all possible peptides except the original sequence are produced. Consequently, in the example shown below for an extension value of three or higher, all extensions up to but not including the fourth extension are provided in the output.

Peptide extension increases the number of predicted peptides

Example of different peptide extension

Sort the output from mass calculations on:

This option is only valid when the Obtain Mass of Predicted peptides task is selected and, therefore, it is ignored with all other tasks. By default the predicted peptides and associated masses are not sorted but this option permits the output to be sorted by different criterion. The predicted peptides can be sorted from low to high Average mass or Monoisotopic mass, or by the location of the predicted peptide Sequence based on location in the original input sequence and any post-translational modifications applied to that peptide.
Ignore processing rules:
This option determines whether or not the processing rules described by Amare et al. 2006 and Southey et al. 2008 are implemented. By default these rules are followed such that cleavage sites are ignored when cleavage could occur at a nearby location (specifically where a basic amino acid was present in the P1 location, a basic amino acid was present in either the P1' or P4' locations, and there was no basic amino acid located in either the P2 or P4 locations, where "P" and "P'" denotes N-terminal and C-Terminal locations relative tot he cleavage site, respectively, such that cleavage occurs between P1 and P1' locations). The ignored sites are termed redundant sites and are denoted as 'r' in the Cleavage Prediction diagram in the output of NeuroPred.

Use basic sites for accuracy statistics?:
This option is only valid when the Model Accuracy Statistics task is selected and, therefore, it is ignored with all other tasks. By default the model accuracy statistics are calculated only with basic amino acids. Selecting No to this option permits the model accuracy to be calculated over the complete sequence including all non-basic sites that are usually considered uncleaved.
TOP

F. Selection of Post-Translational Modifications

Prohormone precursors undergo extensive modification before active neuropeptides and hormones are obtained. In addition to cleavages at basic sites and immediate removal of C-terminal basic residues (Trim C-terminal K and R), several other modifications may be present. The most common PTMs are amidation (of C-terminal glycine) and pyroglutamylation (cyclization of N-terminal glutamate or glutamine). Sulfation of tyrosine (Tyr-Sulfation) and acetylation are also common, albeit occurring somewhat less frequently; thus these four common PTMs are grouped together while even less common PTMs are grouped separately. Disulfide bond formation between two cysteine residues resulting in a mass loss of 2 Da is a common PTM in neuropeptides, but it is difficult to predict whether the disulfide bond is formed between two peptides or within a peptide containing two or more cysteines. In addition, it is difficult to identify the cysteine pairs involved in bond formation. For these reasons, disulfide bond formation is not modeled in NeuroPred. The user may consult web-based tools such as Cyspred to determine potentially disulfide bonding cysteines. A table of the available PTMs can be found here. The advanced option interface permits the selection (or deselection) of individual PTMs where Amidation and Pyroglutamylation are selected by default.

Post-Translational Modifications (PTMs)
Trim C-terminal K and R
Most Common PTMs
Less Common PTMs
Amidation O-linked Glycosylation of S N-linked Glycosylation of S Bromination of W
Pyroglutamination O-linked Glycosylation of T N-linked Glycosylation of T Methylation of E
Acetylation Dipeptidase Hydroxylation of P Methylation of H
Sulfation of Y Carboxylation of E Phosphorylation of S Methylation of K
  DiAcetylation Phosphorylation of T Methylation of R

TOP

G. References

Amare, A., Hummon, A.B., Southey, B.R., Zimmerman, T.A., Rodriguez-Zas, S.L., Sweedler, J.V., Bridging neuropeptidomics and genomics with bioinformatics: prediction of mammalian neuropeptide prohormone processing. J. Proteome Res. 2006, 5, 1162-1167. Abstract.

Hummon, A.B., Hummon, N.P., Corbin, R.W., Li, L.J., Vilim, F.S., Weiss, K.R., Sweedler, J.V., From precursor to final peptides: a statistical sequence-based approach to predicting prohormone processing. J. Proteome Res. 2003, 2, 650-656. Abstract.

Hummon, A.B. Richmond, T.A. Verleyen, P. Baggerman, G. Huybrechts, J. Ewing, M A. Vierstraete, E. Rodriguez-Zas, S.L. Schoofs, L. Robinson, G.E. Sweedler, J.V. , From the Genome to the Proteome: Uncovering Peptides in the Apis Brain, Science 2006, 314, 647-649. Abstract.

Southey, B.R., Rodriguez-Zas, S.L., Sweedler, J.V., Prediction of neuropeptide prohormone cleavages with application to RFamides. Peptides 2006a, 27, 1087-1098. Abstract.

Southey B.R., Amare A., Zimmerman T.A., Rodriguez-Zas S.L., Sweedler J.V., NeuroPred: a tool to predict cleavage sites in neuropeptide precursors and provide the masses of the resulting peptides. Nucleic Acids Res. 2006b, 34 (Web Server issue), W267-272. Abstract.

Tegge, A.N. Southey, B.R. Sweedler, J.V. Rodriguez-Zas, S.L., Enhanced Prediction of Cleavage in Bovine Precursor Sequences. Lecture Notes in Computer Science, Bioinformatics Research and Applications, Vol. 4463, pp. 350-360, 2007, Springer. Abstract.

Southey, B.R., Hummon, A.B., Richmond, T.A., Sweedler, J.V., Rodriguez-Zas, S.L., Prediction of neuropeptide cleavage sites in insects. Bioinformatics, 2008, 24, 815-825. Full Text

Tegge, A.N., Southey, B.R., Sweedler, J.V., Rodriguez-Zas, S.L., Comparative Analysis of Neuropeptide Cleavage Sites in Human, Mouse, Rat, and Cattle. Mamm. Genome, 2008 , 19(2), 106-120. Abstract.

TOP