Predikin Prediction Server

Frequently Asked Questions

These are questions that we've frequently had to answer, but if your question isn't answered here please contact us and we'll do our best to answer it.

We've categorised the questions into the following groups:

General Questions

What is Predikin?

Predikin is a computational method for predicting the substrates of serine/threonine protein kinases.
[top] | [section top]

Who can use Predikin?

Predikin is freely available for academic, non-profit research. If your research is commercial, we are required to charge a small fee. Please contact us for more information.
[top] | [section top]

How is Predikin 2.1 different to previous versions?

Predikin was originally implemented solely as a web application, written in Javascript. Predikin 2.1 is a complete rewrite with the following features:
  • More reliable identification of substrate-determining residues
  • A scoring scheme based on substrate weight matrices
  • Ability to screen multiple substrates of your choice, even complete genomes (not the web version)
  • Additional methods for substrate prediction based on kinase classification
  • Some ability to predict tyrosine kinase substrates
  • A new cleaner web implementation (you're looking at it)
[top] | [section top]

How do I cite Predikin?

Cite the latest version using:
  • Ellis J.J and Kobe B. (2011) Predicting protein kinase specificity: Predikin update and performance in the DREAM4 challenge. PLoS One, 6(7):e21169. [Open Access][PubMed]
The original method is described in:
  • Brinkworth RI, Breinl RA and Kobe B. (2003). Structural basis and prediction of substrate specificity in protein serine/threonine kinases. PNAS USA 100(1):74-79.
Other Predikin publications:
  • Saunders, N.F.W., Brinkworth, R.I., Huber, T., Kemp, B.E. and Kobe, B. (2008). Predikin and PredikinDB: a computational framework for the prediction of protein kinase peptide specificity and an associated database of phosphorylation sites. BMC Bioinformatics 9:245. [Open Access][PubMed]
  • Saunders, N.F.W. and Kobe, B. (2008). The Predikin webserver: improved prediction of protein kinase peptide specificity using structural information. Nucleic Acids Res.[Open Access][PubMed]
[top] | [section top]

How it works

How does Predikin decide which kinases have "similar" SDRs to my kinase?

Similarity is determined using a substitution matrix. If an amino acid residue has a positive substitution value when compared to the SDR, it is assumed that the residue can substitute for that SDR. For instance, if SDR APE-9 is M (Met) in your kinase and you select BLOSUM62, Predikin will select kinases from the substrate database where APE-9 is M, I (Ile), L (Leu) or V (Val).
[top] | [section top]

How does Predikin work?

This is a description of the classical Predikin method which uses substrate-determining residues (SDRs). When you submit a sequence to Predikin, the following steps occur:
  1. The sequence is analysed for protein kinase catalytic domains
  2. If found, domains are classified as serine-threonine, CMGC or tyrosine kinases
  3. If not a tyrosine kinase, substrate-determining residues (SDRs) in the domain are identified. These are residues in the catalytic domain that influence the amino acid frequency at positions -3 to +3 relative to the phosphorylated residue
  4. A database search identifies protein kinases with similar SDRs and where possible, the phosphorylation sites at which they act
  5. The phosphorylation sites are used to build a frequency matrix, describing the predicted amino acid frequency at positions -3 to +3 for the query kinase
  6. The frequencies are converted to weights, which can be used to scan and score an input substrate sequence
[top] | [section top]

What's the KSD method? How does it work?

KSD stands for Kinase Sequence Database. This resource uses HMMs derived from multiple sequence alignments to classify protein kinases into families.

When you submit a kinase sequence to Predikin, it is classified into a KSD family. A database query then selects kinases of the same KSD family and their phosphorylation sites. The sites are used to build frequency and weight matrices that can be used to screen and score a substrate sequence. The idea here is that kinases of the same family tend to exhibit very similar substrate specificity.

The KSD method is a crude predictor, but can be useful in cases where the SDR method fails.

[top] | [section top]

What's the PANTHER method? How does it work?

The PANTHER database clusters sequences from UniProt into families by sequence similarity.

When you submit a kinase to Predikin, it is classified into a PANTHER family using a program named pantherScore. Substrates frequency and weight matrices are then derived from a database query in a very similar way to the KSD method.

Like KSD, the PANTHER method is a crude predictor and is of little use in cases where your kinase PANTHER family is under-represented in the database. However, there are cases where it works well when the SDR method fails.

[top] | [section top]

How are weights calculated from frequencies?

Frequencies are converted to weights using the equation:

W(b,i) = log2(F(b,i) + sqrt(N/20)) / (N + sqrt(N)) / p(b)

where W is the weight of residue b at position i; F is the frequency (raw count) of residue b at position i; N is the number of sequences used (column sum); sqrt(N/20) and (N + sqrt(N)) are pseudocounts and p(b) is the frequency of residue b in all substrate sequences of the kinase type (Ser/Thr, CMGC or Tyr).

Note that for the SDR method, frequencies and weights are calculated independently for each row based on the SDRs for that substrate position. This can mean that N = 0 and hence weights cannot be calculated.

Note also that the highest weight for a row may not always correspond to the highest frequency for a row, depending on the value of p(b) for that residue.

[top] | [section top]

Using the website

Does Predikin work in all web browsers?

Predikin should work with any up to date browser so long as Javascript and cookies are enabled.
[top] | [section top]

I was registered at the old Predikin website. Do I need to register again?

Users are no longer required to register to use Predikin, so no you won't need to re-register.
[top] | [section top]

What's FASTA format?

FASTA is a sequence format, described here.
[top] | [section top]

Which prediction method is best: SDR, KSD or PANTHER?

  • The SDR method is the preferred method for serine-threonine and CMGC kinases. However, if the SDR method generates an incomplete weight matrix, you need to try the KSD or PANTHER matrices.
  • Any scoring matrix will be unreliable if it is based on low frequencies, or frequencies with a lot of similar values in each row. This happens when the database contains too few substrates for a kinase of your type.
  • There will be rare cases where no Predikin method can generate a scoring matrix for your kinase sequence. Sorry, we can't help you with that.
[top] | [section top]

What do the scores mean? What's a good score?

The score for a phosphorylation site is a relative score between 0 and 100. 0 is bad, 100 is good. A higher score means that the phosphorylation site has higher similarity to sites in the database with kinases similar to your kinase (similar SDRs or same KSD/PANTHER family).

Scores are also relative to each other. For a given kinase, a phosphorylation site is a "better" candidate if it scores higher than another site, from either the same or a different substrate sequence. For a given substrate, a kinase is more likely to phosphorylate a site if the score for the site is higher than that for a different kinase sequence.

A good way to view the scores is to export them to a spreadsheet and sort first by substrate ID, then by score. Often it is observed that a substrate sequence contains several sites with poorly-separated scores and a site which scores much higher than the rest, which is likely to be a genuine site.

Some other factors to take into account:

  • In tests using the phosphoELM dataset, the lowest score observed for a known phosphorylation site using any method is around 50. We recommend 60 as a sensible default cutoff.
  • Almost 90% of known phosphorylation sites in phosphoELM are found in a disordered region as predicted by DisEMBL
  • Only about 0.1% of known phosphorylation sites in phosphoELM are found in a TM helix as predicted by TMHMM
  • You might also consider the subcellular localisation of kinase and substrate (if known) when making predictions.
[top] | [section top]

Why are the cells in my scoring matrix coloured red?

Red cells indicate that the frequency was zero for all residues in a row. This means that the row cannot be used in the weight matrix. This only happens using the SDR method, where frequencies/weights are calculated independently for each row based on SDRs.

[top] | [section top]

Problems

My sequence is not accepted?

Make absolutely sure that the sequence is valid FASTA format. If this is the case and there's still a problem, let us know.
[top] | [section top]

My sequence is processed and classified as a kinase, but no scoring matrices are generated?

In very rare cases, it's possible that your sequence is so dissimilar to those in our database that no known substrates are found and hence no matrices can be generated using SDRs or KSD/PANTHER family. Predikin is of no use in such a case.
[top] | [section top]

Predikin says my sequence is not a kinase?

It's possible that your sequence resembles a protein kinase but either (1) the score when compared to a protein kinase HMM is below the threshold or (2) one or more critical residues are absent from the sequence. In either case, your sequence is unlikely to be a genuine protein kinase. If you're sure that it is, let us know and we'll take a look.
[top] | [section top]

What do red rows in my frequency matrix mean?

This may happen when using the SDR method.

The SDR method works by locating SDRs in your sequence, searching the database for kinases with similar SDRs and building a frequency matrix row by row using substrate data for those kinases. In rare cases there may be no kinases with SDRs similar to yours, or no substrate data for kinases with those SDRs. The frequencies for that row will then total zero, so a weight cannot be calculated. In these cases Predikin will refuse to calculate weight matrices or make predictions.

[top] | [section top]
© 2009-2011 University of Queensland