Table of ContentsThe challenge of annotating a complete eukaryotic genome:A case study in Drosophila melanogaster Abstract Tutorial goals Tutorial organization What is a gene? What are annotations? How does an annotation differ from a gene? Transcription and translation Schematic gene structure Sequence feature types DNA transcription unit features mRNA features PPT Slide Definitions for data modeling Annotation Annotation process overview Types of sequence data Auxiliary data Computational annotation tools Database resources Biological issues in annotation Engineering issues in annotation Engineering issues in annotation Engineering issues in annotation Engineering issues in annotation Engineering issues in annotation Drosophila melanogaster Drosophila Genome Project Goals of the Drosophila Genome Project Sequencing at the BDGP The BDGP sequence annotation process What sequence to start with? Which analyses need to be run? Which analyses need to be run and how? What public sequence data sets are needed? Which analyses need to be run and how? How do you achieve computational throughput? What do you do with the results? Is human curation needed? Gene Skimmer Gene Skimmer CloneCurator PPT Slide How do we annotate gene/protein function? Ontology browser PPT Slide Ontology browser: searching for terms How do you distribute the data? Ribbon Ribbon How do you manage the data? How do you maintain annotations? Integrated annotation systems Integrated annotation systems: ACeDB ACeDB Genotator Magpie GAIA TIGR Human Gene Index Computational analysis tools Gene finding: Prokaryotes vs. Eukaryotes Gene finding: Prokaryotes vs. Eukaryotes Integrated gene finding Integrated gene finding: Dynamic programming Integrated gene finding: Dynamic programming Integrated gene finding: Linear and Quadratic Discriminant Analysis (LDA/QDA) Integrated gene finding: Feed-forward neural networks Approaches to gene finding: Hidden Markov models Approaches to gene finding: Generalized hidden Markov models Gene finding software Promoter recognition Promoter recognition (cont.) Promoter recognition (cont.) Promoter recognition (cont.) Example: NNPP Promoter recognition (cont.) Splice site prediction Splice site prediction (cont.) Splice site prediction (cont.) Start codon prediction Poly-adenylation signal prediction Prediction of coding potential Prediction of coding potential (cont.) Prediction of coding potential (cont.) Prediction of coding potential (cont.) Prediction of coding potential (cont.) Prediction of coding exons “Integrated” gene models: LDA/QDA “Integrated” gene models: NN “Integrated” gene models: Artificial intelligence approaches “Integrated” gene models: Artificial intelligence approaches “Integrated” gene models: HMMs “Integrated” gene models: GHMMs Example: Genie “Integrated” gene models: GHMMs EST/cDNA alignment for gene finding: Spliced alignments EST/cDNA alignment EST/cDNA alignment (cont.) Repeat finders Repeat finders (cont.) Homology searching Gene family searching The genome annotation experiment (GASP1) PPT Slide Goals of the experiment Adh contig Adh paper (to appear in Genetics) Raw sequence: Adh.fa Drosophila data sets provided to participants Timetable Resources for assessing predictions Curated data sets for assessing predictions Curated data sets for assessing predictions Curated data sets for assessment Submission format Sample submission Submissions Submissions (cont.) Submissions (cont.) Submissions (cont.) Submissions (cont.) Submissions (cont.) Submissions (cont.) Submissions (cont.) Submissions (cont.) Submissions (cont.) Submissions (cont.) Submissions (cont.) Submissions (cont.) Submissions (cont.) Submissions (cont.) Submissions (cont.) Submissions (cont.) Submissions (cont.) Submission classes Submission classes (cont.) Gene finding techniques Measuring success Definitions and formulae Genes: True positives (TP) Genes: False positives (FP) Genes: False Negatives (FN) Toy example 1 (1) Genes: Missing Genes (MG) Genes: Wrong Genes (WG) Toy example 1 (2) Genes: Std 1 versus Std 3 Toy example 1 (3) Genes: Std1 and Std3 versus “real” gene structure Toy example 1 (4) Toy example 1 (5): Exon level Genes: Joined genes (JG) Genes: Split genes (SG) Definition: “Joined” and “split” genes Toy example 2 (1) Annotation experiment results Results: Base level Results: Exon level Results: Gene level Results: Gene level Results (protein homology): Base level Results (protein homology): Exon level Results (protein homology): Gene level Transcription Start Site (TSS): Standard 1 TSS: Standard 3 Results: TSS recognition Interesting gene examples: bubblegum Adh/Adhr (Alcohol dehydrogenase/Adh related) Adh/Adhr (cont..) osp (outspread) cact (cactus) kuz (kuzbanian) beat (beaten path) Idfg1, Idfg2, Idfg3 (Imaginal Disc Growth Factor) Idfg1, Idfg2, Idfg3 (cont.) Conclusion of GASP1 Conclusion GASP1 (cont.) Discussion GASP1 Conclusions on annotating complete eukaryotic genomes Conclusions on annotating complete eukaryotic genomes (cont.) Discussion on annotating complete eukaryotic genomes Acknowledgments |
Author: Martin G. Reese, Nomi L. Harris,
George Hartzell, Suzanna E. Lewis
Email: mgreese@lbl.gov Home Page: http://www.fruitfly.org/GASP1 Other information: |