Projects

Representative Benchmark Data Sets of Human DNA Sequences

Please bookmark this site and NOT the sites of the data it is pointing to!

This site will serve as a stable interface to access the datasets, but the location of the sets will be subject to change.

The race to sequence the whole human genome is entering its final stage. In order to build analytical methods for the detection and characterization of human genes and their regulatory regions, accurate datasets are necessary.

Thus, we aim to provide common data sets to be shared among various research groups as a stable basis for the evaluation and comparison of different methods for the analysis of human DNA sequences. The generation of datasets of confirmed genes is a very time consuming part. Therefore we make our ready-to-use training and test sets available and encourage researchers in the community to use these common datasets for the development of their methods. Common data sets allow also a fair and rigorous scientific comparison between different methods.

We provide these data sets "as is" in an effort to create a common data set to be used by all algorithms which are aimed towards gene finding and the identification of regulatory sequences (promoters and splice sites).

Our goal when generating these data sets was to make them relatively "clean" and to ensure that each sequence conforms to specific criteria which are listed in the documentation files that accompany the data sets. Therefore we used restrictive filters which are run in irregularly intervals on the data bases to create larger sets as more data becomes available. Each set is divided into a number of disjoint parts which can be used for a cross-validated evaluation.

These datasets were generated in a collaboration between the Informatics group of the Berkeley Drosophila Genome project at the Lawrence Berkeley National Laboratory (LBNL), the Computational Biology Group at the UC Santa Cruz, the Mathematics Department at Stanford and the Chair for Pattern Recognition at the University of Erlangen, Germany.


Currently, we offer the following data sets:

A similar collection is also available for D. melanogaster DNA sequences. Also, check out the Website for C. elegans gene finding resources.

In addition to these data sets, we also used data collected by other authors to evaluate the performance of our programs. These sets comprise

We would very much appreciate comments on the appropriateness of the data set and on the results obtained with them. Also, we encourage anyone to create similar data sets for other DNA pattern recognition tasks.


Martin Reese (LBNL) martinr@bdgp.lbl.gov Uwe Ohler (University of Erlangen) ohler@informatik.uni-erlangen.de