NNPP is a method that finds eukaryotic and prokaryotic promoters in a DNA sequence. The function of the promoter as a initiator for transcription is one of the most complex processes in molecular biology. It has been shown that multiple functional sites in the primary DNA are involved in the polymerase binding process. These elements, such as the TATA-box and the transcription start site ("Initiator") for eukaryotes, are known to function as binding sites for Polymerase II, transcription factors, and other proteins that are involved in the transcription initiation process. These promoter elements are present in various combinations separated by various distances in the sequence.
The basis of the NNPP program is a time-delay neural network (see further References for details). The time-delay network consists mainly of two feature layers, one for recognizing the TATA-box and one for recognizing the "Initiator", which is the region spanning the transcription start site. Both feature layers are combined into one output unit, which gives output scores between 0 and 1. The neural network method is described in detail in
(1) Reese, M.G. Diploma Thesis, 1994 German Cancer Research Center, Heidelberg.
(2) Reese, M.G. and Eeckman, F.H. (1995) "Novel Neural Network Algorithms for Improved Eukaryotic Promoter Site Recognition". The Seventh International Genome Sequencing and Analysis Conference, Hilton Head Island, South Carolina. Abstract
(3) Reese, M.G., Harris, N.L. and Eeckman, F.H. (1996) "Large Scale Sequencing Specific Neural Networks for Promoter and Splice Site Recognition" Biocomputing: Proceedings of the 1996 Pacific Symposium edited by Lawrence Hunter and Terri E. Klein, World Scientific Publishing Co, Singapore, 1996, January 2-7, 1996. Abstract
Please cite these when quoting NNPP output.
A careful 4-fold cross validation test on 429 eukaryotic RNA Polymerase II promoters from the Eukaryotic Promoter Database (EPD, version 50)
Bucher,P. & Trifonov,E.N. (1986). Compilation and analysis of eukaryotic POL II promoter sequences. Nucl. Acids Res. 14, 10009-10026.
Bucher, P. (1989). Weight Matrix Description of Four Eukaryotic RNA Polymerase II Promotor Elements Derived from 502 Unrelated Promotor Sequences. J. Mol. Biol. 212, 563-578.
and on 305 unrelated genes with less than 50% pairwise sequence identity (gene data set) gave the following results (results averaged over both test sets):
+------------+-----------+------------+------------+
| threshold | % | | correlation|
| | promoters | false | coefficient|
| | recognized| positives | (CC) |
+------------+-----------+------------+------------+
| | | | |
| 0.99 | 10% | 0.0% | 0.38 |
| | | | |
+------------+-----------+------------+------------+
| | | | |
| 0.97 | 20% | 0.0-0.1% | 0.38 |
| | | | |
+------------+-----------+------------+------------+
| | | | |
| 0.92 | 30% | 0.1-0.3% | 0.50 |
| | | | |
+------------+-----------+------------+------------+
| | | | |
| 0.85 | 40% | 0.1-0.4% | 0.60 |
| | | | |
+------------+-----------+------------+------------+
| | | | |
| 0.70 | 50% | 0.8-1.0% | 0.65 |
| | | | |
+------------+-----------+------------+------------+
| | | | |
| 0.38 | 60% | 1.0-3.1% | 0.61 |
| | | | |
+------------+-----------+------------+------------+
| | | | |
| 0.20 | 70% | 2.2-5.3% | 0.58 |
| | | | |
+------------+-----------+------------+------------+
| | | | |
| 0.12 | 80% | 5.1-12.5% | 0.52 |
| | | | |
+------------+-----------+------------+------------+
These percentages are defined by:
predicted promoters
promoters recognized = -------------------------
all observed promoters
predicted promoters
false positives = -------------------------
all observed non-promoter
(TPxTN)-(FNxFP)
correlation coefficient (CC) = ------------------------------------
________________________________
V (TP+FN)x(TN+FP)x(TP+FP)x(TN+FN)
TP = true positive = promoters recognized TN = true negative = non-promoters recognized FP = false positive = observed non-promoters predicted as promoters FN = false negatives = observed promoters predicted as non-promoters
A careful cross validated test on 272 prokaryotic E. coli promoters collected and described in
gave the following results:
+------------+-----------+------------+------------+
| threshold | % | | correlation|
| | promoters | false | coefficient|
| | recognized| positives | (CC) |
+------------+-----------+------------+------------+
| | | | |
| 0.9 | 50% | 0.3% | 0.71 |
| | | | |
+------------+-----------+------------+------------+
| | | | |
| 0.8 | 60% | 0.4% | 0.72 |
| | | | |
+------------+-----------+------------+------------+
| | | | |
| 0.65 | 70% | 0.9% | 0.73 |
| | | | |
+------------+-----------+------------+------------+
| | | | |
| 0.55 | 75% | 1.3% | 0.72 |
| | | | |
+------------+-----------+------------+------------+
| | | | |
| 0.35 | 80% | 1.7% | 0.72 |
| | | | |
+------------+-----------+------------+------------+
| | | | |
| 0.15 | 90% | 2.7% | 0.70 |
| | | | |
+------------+-----------+------------+------------+
| | | | |
| 0.03 | 95% | 4.7% | 0.63 |
| | | | |
+------------+-----------+------------+------------+
The performance per base position was tested on the pBR322 vector:
+------------+-----------+------------+------------+
| threshold | % | | correlation|
| | promoters | false | coefficient|
| | recognized| positives | (CC) |
+------------+-----------+------------+------------+
| | | | |
| 0.96 | 30% | 0.03% | 0.38 |
| | | | |
+------------+-----------+------------+------------+
| | | | |
| 0.92 | 50% | 0.11% | 0.48 |
| | | | |
+------------+-----------+------------+------------+
| | | | |
| 0.89 | 80% | 0.16% | 0.51 |
| | | | |
+------------+-----------+------------+------------+
Further References and Abstract
Another promoter finder on the Web There exists an additional program SIGNALSCAN developed by Dr. Dan Prestridge which can be used to search for transcription factor binding sites in promoter regions. The program can be accessed at 2 different WWW sites: SIGNALSCAN at NIH or SIGNALSCAN in Singapore.