Release 2.5 Drosophila Genomic Sequence

(Last assembled Mon Jun 24 02:09:13 PDT 2002)


X1-11	X12-20	2L	2R	3L	3R	4	Main

The Release 2 (October 2000) sequence of the Drosophila genome produced by the BDGP and Celera Genomics is a whole genome shotgun sequence assembly with ~1300 gaps and a limited number of regions of low sequence quality. We are in the process of producing Release 3. Our aims for Release 3 are:

To close all gaps in the euchromatic portion of the genome
To raise the quality of the sequence uniformly to an estimated error rate of less than one in 100,000 base pairs in the unique portion and less than one in 10,000 in the repetitive portion
To confirm the accuracy of the assembly by comparison to restriction digests of BAC clones
To replace the composite transposon sequences with true sequence of each transposon.

For Release 3, Celera has provided the BDGP with the primary sequence trace data from the whole genome shotgun project with the traces sorted in the order they occur in the Release 2 sequence assembly. We are producing Release 3 by a six step process. Chromosome arms 2L, 2R, 3R, 4 and numbered divisions 12-20 of the X chromosome are being finished at Lawrence Berkeley National Laboratory, and questions on this portion of the genome sequence may be directed to bdgp@fruitfly.org. Chromosome arm 3L and numbered divisions 1-11 of the X chromosome are being finished at the Human Genome Sequencing Center at Baylor College of Medicine, and questions on this portion of the genome sequence may be directed to David Wheeler (wheeler@bcm.tmc.edu) and Steve Scherer (scherer@bcm.tmc.edu). The finishing strategy is as follows:

Using BAC end sequences, we sort the Celera sequence traces into BAC-sized bins.
We assemble the sorted whole genome shotgun sequence traces with our own draft sequence traces in BAC-sized pieces using the Phrap assembler (P. Green, University of Washington, Seattle).
We use the protocols that are in place at both sites for finishing each BAC to at least the "Phase 3" standard (<1/10,000 errors; all gaps closed if possible with current technology); the details differ somewhat between LBNL and BCM, but both sites have extensive experience in finishing.
We compare the predicted restriction digests of each sequence assembly with three restriction digests (EcoRI, HinDIII and BamHI) of each BAC to confirm the accuracy of each assembly.
We compare the new assemblies to the Release 2 whole genome assembly and resolve any discrepancies using the BAC restriction digests or other methods (e.g., PCR from uncloned genomic DNA) to determine the correct assembly.
The BAC-sized sequences are concatenated into large contigs. The contigs will be subdivided into ~350 kb sections to match the original Release 1 GenBank records. The newly annotated sequences will be submitted as updates to the GenBank records and will retain the original scaffold ID and accession numbers wherever possible (in some cases one record may be divided into two new GenBank records).

Once sequences have passed through steps 1-5, we submit the BAC-based sequences to GenBank, and we make the BAC-based and concatenated large sequence contigs publicly available through this web site. This Release 2.5 sequence is of high quality and contains no gaps.

Note on accession numbers BACs that are verified and submitted to NCBI have their GenBank accession numbers displayed next to the BAC name on the assembly page. BACs that are part of finished regions but do not yet have accession numbers are noted with an asterisk. These are generally finished to Phase 3 standards but have not completed all of the quality control checks. Some of these BACs may have old records available at GenBank, but the sequence may differ from the working sequence available here.

Click on any of the chromosome arm names to see the clones and assembled segments for that arm.


X1-11	X12-20	2L	2R	3L	3R	4	Main

Assembly Statistics

Arm	Bases in assembled segments
X1-11	11405022
X12-20	8788192
2L	22205349
2R	20300755
3L	23088932
3R	27902919
4	1236870


# segments	37
Total bases in segments	114928039
Average segment size	3106163
Largest segment size	27902919