diArk - a resource for eukaryotic genome research
Since the publication of the first complete genome sequence of an eukaryote, Saccharomyces cerevisiae [1], the genome sequencing community has produced highly advanced drafts of many other eukaryotes. The past few years have thus seen the rise of a completely new field in biology that is described as comparative genomics [2]. Initial results have shown that whole genome comparisons are important to improve the annotation of genes and transcripts of a genome. It has also been demonstrated that not only genome sequences of organisms spread over all kingdoms of eukaryotic life are needed but also many of closely related organisms [3].
We have developed diArk (digital ark) providing information on eukaryotic sequencing projects that resulted either in at least preliminary assemblies of genome data or a substantial amount of EST or cDNA data [4]. In the center of the database are extensive species-related information (commonly and alternatively used scientific names, common names, and complete taxonomies) and much information about the respective species sequencing projects. Apart from the up-to-date status of the data our focus has been on a feature rich user interface with comprehensive and easy-to-use search capabilities.
| [1] | A Goffeau, BG Barrell, H Bussey, RW Davis, B Dujon, H Feldmann, F Galibert, JD Hoheisel, C Jacq, M Johnston et al: Life with 6000 genes. Science 1996, 274:563-7. |
| [2] | TT Binnewies, Y Motro, PF Hallin, O Lund, D Dunn, T La, DJ Hampson, M Bellgard, TM Wassenaar, DW Ussery: Ten years of bacterial genome sequencing: comparative-genomics-based discoveries. Funct Integr Genomics 2006, 6:165-185. |
| [3] | JE Galagan, MR Henn, LJ Ma, CA Cuomo, B Birren: Genomics of the fungal kingdom: insights into eukaryotic biology. Genome Res 2005, 15:1620-31. |
| [4] | F Odronitz, M Hellkamp, M Kollmar: diArk – a resource for eukaryotic genome research. BMC Genomics 2007, 8:103. |
CyMoBase - a database for cytoskeletal and motor proteins
Annotation of protein sequences of eukaryotic organisms is crucial for the understanding of their function in the cell. Manual annotation is still by far the most accurate way to correctly predict genes. The classification of protein sequences, their phylogenetic relation and the assignment of function involves information from various sources. This often leads to a collection of heterogeneous data, which is hard to track. Cytoskeletal and motor proteins consist of large and diverse superfamilies comprising up to several dozen members per organism. Since genome sequence data is rapidly accumulating it is very important to have a reference database for the nomenclature and phylogenetic relation of the proteins that allows the most accurate assignment of biological function possible. CyMoBase is a protein sequence-centric web application to store, organize, interrelate, and present heterogeneous data that is generated during manual genome annotation and comparative genomics [1]. It offers many analysis tools like extensive statistics or a BLAST service.
| [1] | F Odronitz, M Kollmar: Pfarao: A web application for protein family analysis customized for cytoskeletal and motor proteins (CyMoBase). BMC Genomics 2006, 7:300. |
Scipio - eukaryotic gene identification
In the post-genome era, sequence data is the entry point for many studies. Often, it is of high relevance to obtain the correct genomic DNA sequences of eukaryotic genes because of the important information contained in non-coding regions. For example, the intron regions contain important sites for the regulation of gene transcription like enhancers, repressors, and silencers [1]. The determination of the exon/intron structures of genes is also important in comparative genomic analysis like the identification of ancient exons [2].
Currently, two programs are available for the retrieval of non-coding sequence. The Java application Retrieval of Regulative Regions (RRE) parses annotation and homology data from NCBI [3]. RRE requires local installation and a local copy of the desired genomes and annotation files. The web application of RRE only hosts a small number of eukaryotic genomes and only annotation data from NCBI. Recently, the non-coding sequences retrieval system (NCSRS) has been published [4] that has 16 genomes and annotation data from both NCBI and Ensembl. In summary, both tools only parse annotation files provided by NCBI and Ensembl for a few organisms.
We have developed Scipio for the retrieval of the genome sequence corresponding to a protein query. The tool does not require any annotation data, and is able to correctly identify the gene even if this is spread on several genome contigs and contains mismatches and frameshifts. Because of its post-processing capabilities, Scipio is not only able to correctly identify the gene in the genome corresponding to the protein query but also to correctly identify the homologous genes in the genomes of closely related organisms.
| [1] | L Fedorova, A Fedorov: Introns in gene evolution. Genetica 2003, 118:123-31. |
| [2] | M Irimia, JL Rukov, D Penny, SW Roy: Functional and evolutionary analysis of alternatively spliced genes is consistent with an early eukaryotic origin of alternative splicing. BMC Evol Biol. 2007, 7:188. |
| [3] | F Lazzarato, G Franceschinis, M Botta, F Cordero, RA Calogero: RRE: a tool for the extraction of non-coding regions surrounding annotated genes from genomic datasets. Bioinformatics 2004, 20:2848-2850. |
| [4] | ST Doh, Y Zhang, MH Temple, L Cai: Non-coding sequence retrieval system for comparative genomic analysis of gene regulatory elements. BMC Bioinformatics 2007, 8:94. |







