Frequently Asked Questions
These are questions that we've frequently had to answer, but if your question isn't answered here please contact us and we'll do our best to answer it.
We've categorised the questions into the following groups:
General Questions
- What is Predikin?
- Who can use Predikin?
- How is Predikin 2.1 different to previous versions?
- How do I cite Predikin?
What is Predikin?
Who can use Predikin?
How is Predikin 2.1 different to previous versions?
- More reliable identification of substrate-determining residues
- A scoring scheme based on substrate weight matrices
- Ability to screen multiple substrates of your choice, even complete genomes (not the web version)
- Additional methods for substrate prediction based on kinase classification
- Some ability to predict tyrosine kinase substrates
- A new cleaner web implementation (you're looking at it)
How do I cite Predikin?
- Ellis J.J and Kobe B. (2011) Predicting protein kinase specificity: Predikin update and performance in the DREAM4 challenge. PLoS One, 6(7):e21169. [Open Access][PubMed]
- Brinkworth RI, Breinl RA and Kobe B. (2003). Structural basis and prediction of substrate specificity in protein serine/threonine kinases. PNAS USA 100(1):74-79.
- Saunders, N.F.W., Brinkworth, R.I., Huber, T., Kemp, B.E. and Kobe, B. (2008). Predikin and PredikinDB: a computational framework for the prediction of protein kinase peptide specificity and an associated database of phosphorylation sites. BMC Bioinformatics 9:245. [Open Access][PubMed]
- Saunders, N.F.W. and Kobe, B. (2008). The Predikin webserver: improved prediction of protein kinase peptide specificity using structural information. Nucleic Acids Res.[Open Access][PubMed]
How it works
- How does Predikin decide which kinases have "similar" SDRs to my kinase?
- How does Predikin work?
- What's the KSD method? How does it work?
- What's the PANTHER method? How does it work?
- How are weights calculated from frequencies?
How does Predikin decide which kinases have "similar" SDRs to my kinase?
How does Predikin work?
- The sequence is analysed for protein kinase catalytic domains
- If found, domains are classified as serine-threonine, CMGC or tyrosine kinases
- If not a tyrosine kinase, substrate-determining residues (SDRs) in the domain are identified. These are residues in the catalytic domain that influence the amino acid frequency at positions -3 to +3 relative to the phosphorylated residue
- A database search identifies protein kinases with similar SDRs and where possible, the phosphorylation sites at which they act
- The phosphorylation sites are used to build a frequency matrix, describing the predicted amino acid frequency at positions -3 to +3 for the query kinase
- The frequencies are converted to weights, which can be used to scan and score an input substrate sequence
What's the KSD method? How does it work?
KSD stands for Kinase Sequence Database. This resource uses HMMs derived from multiple sequence alignments to classify protein kinases into families.
When you submit a kinase sequence to Predikin, it is classified into a KSD family. A database query then selects kinases of the same KSD family and their phosphorylation sites. The sites are used to build frequency and weight matrices that can be used to screen and score a substrate sequence. The idea here is that kinases of the same family tend to exhibit very similar substrate specificity.
The KSD method is a crude predictor, but can be useful in cases where the SDR method fails.
What's the PANTHER method? How does it work?
The PANTHER database clusters sequences from UniProt into families by sequence similarity.
When you submit a kinase to Predikin, it is classified into a PANTHER family using a program named pantherScore. Substrates frequency and weight matrices are then derived from a database query in a very similar way to the KSD method.
Like KSD, the PANTHER method is a crude predictor and is of little use in cases where your kinase PANTHER family is under-represented in the database. However, there are cases where it works well when the SDR method fails.
How are weights calculated from frequencies?
Frequencies are converted to weights using the equation:
W(b,i) = log2(F(b,i) + sqrt(N/20)) / (N + sqrt(N)) / p(b)
where W is the weight of residue b at position i; F is the frequency (raw count) of residue b at position i; N is the number of sequences used (column sum); sqrt(N/20) and (N + sqrt(N)) are pseudocounts and p(b) is the frequency of residue b in all substrate sequences of the kinase type (Ser/Thr, CMGC or Tyr).
Note that for the SDR method, frequencies and weights are calculated independently for each row based on the SDRs for that substrate position. This can mean that N = 0 and hence weights cannot be calculated.
Note also that the highest weight for a row may not always correspond to the highest frequency for a row, depending on the value of p(b) for that residue.
Using the website
- Does Predikin work in all web browsers?
- I was registered at the old Predikin website. Do I need to register again?
- What's FASTA format?
- Which prediction method is best: SDR, KSD or PANTHER?
- What do the scores mean? What's a good score?
- Why are the cells in my scoring matrix coloured red?
Does Predikin work in all web browsers?
I was registered at the old Predikin website. Do I need to register again?
Which prediction method is best: SDR, KSD or PANTHER?
- The SDR method is the preferred method for serine-threonine and CMGC kinases. However, if the SDR method generates an incomplete weight matrix, you need to try the KSD or PANTHER matrices.
- Any scoring matrix will be unreliable if it is based on low frequencies, or frequencies with a lot of similar values in each row. This happens when the database contains too few substrates for a kinase of your type.
- There will be rare cases where no Predikin method can generate a scoring matrix for your kinase sequence. Sorry, we can't help you with that.
What do the scores mean? What's a good score?
The score for a phosphorylation site is a relative score between 0 and 100. 0 is bad, 100 is good. A higher score means that the phosphorylation site has higher similarity to sites in the database with kinases similar to your kinase (similar SDRs or same KSD/PANTHER family).
Scores are also relative to each other. For a given kinase, a phosphorylation site is a "better" candidate if it scores higher than another site, from either the same or a different substrate sequence. For a given substrate, a kinase is more likely to phosphorylate a site if the score for the site is higher than that for a different kinase sequence.
A good way to view the scores is to export them to a spreadsheet and sort first by substrate ID, then by score. Often it is observed that a substrate sequence contains several sites with poorly-separated scores and a site which scores much higher than the rest, which is likely to be a genuine site.
Some other factors to take into account:
- In tests using the phosphoELM dataset, the lowest score observed for a known phosphorylation site using any method is around 50. We recommend 60 as a sensible default cutoff.
- Almost 90% of known phosphorylation sites in phosphoELM are found in a disordered region as predicted by DisEMBL
- Only about 0.1% of known phosphorylation sites in phosphoELM are found in a TM helix as predicted by TMHMM
- You might also consider the subcellular localisation of kinase and substrate (if known) when making predictions.
Why are the cells in my scoring matrix coloured red?
Red cells indicate that the frequency was zero for all residues in a row. This means that the row cannot be used in the weight matrix. This only happens using the SDR method, where frequencies/weights are calculated independently for each row based on SDRs.
Problems
- My sequence is not accepted?
- My sequence is processed and classified as a kinase, but no scoring matrices are generated?
- Predikin says my sequence is not a kinase?
- What do red rows in my frequency matrix mean?
My sequence is not accepted?
My sequence is processed and classified as a kinase, but no scoring matrices are generated?
Predikin says my sequence is not a kinase?
What do red rows in my frequency matrix mean?
This may happen when using the SDR method.
The SDR method works by locating SDRs in your sequence, searching the database for kinases with similar SDRs and building a frequency matrix row by row using substrate data for those kinases. In rare cases there may be no kinases with SDRs similar to yours, or no substrate data for kinases with those SDRs. The frequencies for that row will then total zero, so a weight cannot be calculated. In these cases Predikin will refuse to calculate weight matrices or make predictions.