Adh datasets for assessing the Genome Annotation Experiment results
One of the major challenges in evaluating the effectiveness of
sequence annotation systems is the lack of powerful reference data
sets.
We developed two data sets to help us evaluate participants'
submissions:
- Std1(original ISMB file) is based on high quality cDNA<->genomic sequence
alignments. Starting with a set of 80 full length cDNA sequences
from the Adh region, we ended up with 43 annotated transcripts
(start_codon, stop_codon, exon, CDS, splice3, and splice5
features) with strong alignments to the genomic sequence and
whose splice sites matched a simple "GT/AG" consensus and scored
well using a neural net splice site predictor.
We hope that this data set, with its narrow and stringent
criteria, can be used as an effective estimate of a set of "known
to be correct" annotations.
- Std1_corrected
further corrected (1/31/00) orignial
ISMB std1. Five suspicious cDNA alignments were removed. Total of
38 transcripts remaining.
- Std3 is based on the BDGP's annotations of the Adh region, as
described in Ashburner et al. These
annotations combine computational and biological research results
under the supervision of experienced Drosphila biologists. With
222 transcript annotations, this set is much more extensive than
std1. Approximately 182 of the annotations are similar to a
known protein sequence or a Drosphila EST, while approximately 40
are based on computational results.
While there is less experimental evidence for std3's annotations,
we hope that with its broad coverage and careful curation it can
be used as an effective estimate of the full set of genes that
exist in this region.
- Additionally, we compiled a list of 92 5' UTR
start sites
from std3. All of them were confirmed by full-length cDNA alignment and contain
a complete open reading frame downstream. This set of UTRs was used to evaluate the
promoter predictions.
compfly@bdgp.lbl.gov