Projects
Representative Benchmark Data Sets of Human DNA Sequences
Please bookmark this site and NOT the sites of the data it is pointing to!
This site will serve as a stable interface to access the datasets, but the location of the sets will be subject to change.
The race to sequence the whole human genome is entering its final stage. In order to build analytical methods for the detection and characterization of human genes and their regulatory regions, accurate datasets are necessary.
Thus, we aim to provide common data sets to be shared among various research groups as a stable basis for the evaluation and comparison of different methods for the analysis of human DNA sequences. The generation of datasets of confirmed genes is a very time consuming part. Therefore we make our ready-to-use training and test sets available and encourage researchers in the community to use these common datasets for the development of their methods. Common data sets allow also a fair and rigorous scientific comparison between different methods.
We provide these data sets "as is" in an effort to create a common data set to be used by all algorithms which are aimed towards gene finding and the identification of regulatory sequences (promoters and splice sites).
Our goal when generating these data sets was to make them relatively "clean" and to ensure that each sequence conforms to specific criteria which are listed in the documentation files that accompany the data sets. Therefore we used restrictive filters which are run in irregularly intervals on the data bases to create larger sets as more data becomes available. Each set is divided into a number of disjoint parts which can be used for a cross-validated evaluation.
These datasets were generated in a collaboration between the Informatics group of the Berkeley Drosophila Genome project at the Lawrence Berkeley National Laboratory (LBNL), the Computational Biology Group at the UC Santa Cruz, the Mathematics Department at Stanford and the Chair for Pattern Recognition at the University of Erlangen, Germany.
Currently, we offer the following data sets:
-
GENIE gene finding data set, containing a total of 793 unrelated human genes. This data set was used to train the GENIE gene finding system (WWW-access) developed at LBNL and UC Santa Cruz. The last update was done in March 1998 using GenBank v.105. Please see the documentation file for further information, including links to previous versions of this data collection.
-
The collection of data for human splice sites used in the GENIE system. The splice sites were extracted from the 1996 GENIE data set using GenBank v.95. The set also contains negative samples. Further information
-
The collection of data of human and additional eukaryotic promoter regions. The promoters were extracted from the Eukaryotic Promoter Database rel. 50; the negative set contains coding and noncoding sequences from the 1998 GENIE data set. Further information
A similar collection is also available for D. melanogaster DNA sequences. Also, check out the Website for C. elegans gene finding resources.
In addition to these data sets, we also used data collected by other authors to evaluate the performance of our programs. These sets comprise
-
The collection of coding and non-coding sequences used in the survey of Fickett & Tung (NAR vol. 20, no. 24, p. 6441-6450, 1992).
-
The set of contiguous DNA sequences used in the promoter prediction tool evaluation of Fickett & Hatzigeorgiou (Genome Res. vol. 7, p. 861-878, 1997).
-
The set of genes used by Burset & Guigo (Genomics vol. 34, no. 3, p. 353-367, 1996) for the comparison of different gene finders.
We would very much appreciate comments on the appropriateness of the data set and on the results obtained with them. Also, we encourage anyone to create similar data sets for other DNA pattern recognition tasks.
Martin Reese (LBNL) martinr@bdgp.lbl.gov Uwe Ohler (University of Erlangen) ohler@informatik.uni-erlangen.de