INVESTIGADORES
FARBER Marisa Diana
congresos y reuniones científicas
Título:
Another Tool for Genomic Comprehension (ATGC): an ontology driven database and web interface applied to Sunflower Microarray Project
Autor/es:
BERNARDO J. CLAVIJO; PAULA FERNANDEZ; GONZALEZ SERGIO; RIVAROLA M; HEINZ R; FARBER M; NORMA PANIEGO
Lugar:
Santiago de Chile
Reunión:
Conferencia; ISCB Latin America; 2012
Institución organizadora:
International Society for Computational Biology
Resumen:
Although microarray technology started a new era of high-throughput transcriptomic analysisapproximately ten years ago, starting with 8,000 printed genes by Affymetrix in Arabidopsis thalianaand later on scaling up to 45,000 printed genes in rice and 90,000 in Brassica, next generationsequencing (NGS) technologies are nowadays opening a new era of even deeper understanding ofgenomics and transcriptomics in different species. However, for the foreseeable future bothtechnologies will coexist each focusing on different tasks, or by complementing biological and valueinformation or by designing dedicated oligonucleotide arrays to support functional studies on aspecified pathway/developmental stage. One obvious application of microarray technology is thetranscriptional profiling in species that have neither their own genome sequenced nor a referencegenome from a closely related species. For some of these species a commercial microarray based on anexisting own-design are available (Agilent, Affimetrix, Nimblegen, etc). Sunflower is a species that fitsinto this framework, even though a genome sequence initiative is in progress, there is no referencegenome available. In this case, the only source of functional information is limited to ESTs databases,which in the case of cultivated sunflower is rather extensive, more than 133,000 ESTs are publiclyavailable (http://ncbi.nlm.nih.gov/dbEST/dbEST_summary.html) covering libraries prepared fromseveral lines and cultivars, and the production of ca. 6Gb of next-generation sequence data assemblyfor the purposes of SNP discovery, recently published. However, it should also be noted that ESTslibraries tend to be significantly contaminated with vector sequences and chimeras, and have relativelylow quality DNA information derived from the library sequencing strategy which prioritizes obtaininga large number of single pass sequences, being necessary to standardize a set of bioinformatics routinesin order to clean and decontaminate public raw sequences.Currently, the shortage of candidate genes underlying agronomically important traits represents one ofthe main drawbacks in sunflower molecular breeding. In this context, functional tools which allowconcerted transcriptional studies, as high density oligonucleotide microarray, strongly support thediscovery and characterization of novel genes. Oligonucleotide-based chips not only allow the analysisfor a whole transcriptome but they are also considered more accurate than cDNA-based chips due tothe reduction of manipulation steps. The possibility to implement this technology on any custom arraysystem like Agilent, Nimblegen, and others, has the potential to create a very useful tool for genediscovery in non-model crops. In addition, the use of longer probe format represents a major advantageof Agilent oligonucleotide microarrays over others technologies based on a higher stability in thepresence of sequence mismatches, being consequently, more suitable for the analysis of highlypolymorphic regions.In our lab, a public and proprietary datasets of H. annuus L. ESTs have been used to create acomprehensive sunflower unigene collection. In this study, public and proprietary H. annuus L. ESTdatasets have been used to create a comprehensive unigene collection. These dataset comprises 34cDNA libraries available from different cultivars and various tissues and anatomical parts, from plantsgrown at different physiological conditions. In this work, we present the development of acomprehensive Sunflower Unigene Resource of H. annuus L. (SUR v 1.0), its functional annotationand the design and validation of a custom sunflower oligonucleotide-based microarray foridentification of concerted transcriptional responses associated to biotic and abiotic responses. Thisdevelopment represents an initiative of the Sunflower Argentinean Consortium, working incollaboration with the Institute Principe Felipe, Valencia, Spain, within the frame of a public researchproject. To design and customize this microarray, clustering and assembling of 133,682 public ESTswas achieved resulting in 12,924 contigs and 28,089 singletons by using CAP3 with parameters setaccordingly to the most relevant and recently published microarray designs (p=95, f=45, h=25, o=80).After cleaning and removal of low quality and short (<100 bp) sequences, the dataset was reduced to132,479 reads. Also, additional processed ESTs or gene sequences of special interest for relevant traitswere added to the initial dataset. The final assembly resulted in 41,013 putative transcripts. Thisanalysis showed no bias among ESTs originated from sunflower cDNA libraries deposited in GenBank,giving strong evidence about the microarray´s design and its potential functional coverage. Finally, aset of 678 consensus contigs (or super-contigs) was generated from unigenes that showed a highBLAST sequence homology but did not cluster together in the CAP3 assembly. These super-contigscould address potential variants stemming from sequencing errors, gene duplication processes or allelicvariants. These contigs were included in the microarray design, which resulted in 40,169 probes.Moreover, GO terms mapping were carefully done running Blast2GO against a local GO database(2011-08 update). Annotation was completed by running a local installation of InterProScan v4.7followed by InterPro2GO (database version 31.0, release February 2011). Hence, we considered thewhole sequences with BLASTX hits and used the same reading frame, and for anonymous sequenceswe considered 6-frame translations.In this work, we present ATGC (Another Tool for Genomic Comprehension), a database to store,visualize, analyze and share this information, also including probes associated to each unigenerepresented in the microarray. This database is available at http://bioinformatica.inta.gov.ar/ATGC/,actually with user and password restriction access. ATGC is based on Chado (Generic Model OrganismDatabase, http://gmod.org), an ontology driven relational database schema implemented in PostgreSQL,and a web interface based on web2py. One of the main goals for ATGC is to facilitate the explorationand visualization of the data. The main development effort was done to exploit GO annotation andanalyzing the annotated genes, allowing users to move through the GO-DAG structure. This approachnavigates between different classes of available genes on different projects. A strong emphasis hasbeen dedicated on having each gene once in each GO category. GO term Feature Search pages showsevery feature directly annotated, adding every feature indirectly annotated and mentioning, for everyfeature displayed, which term inherits the searched name or ID. This routine has expanded dramaticallythe possibilities for interpretation and exploration centered on GO annotation. As a way to facilitate theaccess to information, we have included in the Feature Detail Page all information related to the featureincluding links to related data. Oligonucleotide microarray probe sequence and a list of potential crosshybridization probes probably matching the same unigene are among this data. We are currentlyworking on minor debugging and creating an easy installer for ATGC for different Unix distributionsand even Windows Operative System. The sunflower project is expanding new possibilities, whereasupdated ontologies are being tested to add more information, planning the integration of a genomebrowser through DMAP to enable genomic querying (see poster DMAP in this meeting). Finally, weplanned to optimize the collection management features, allowing users to create and manipulate listsof features by different criteria, even connecting the database to complementary platforms for dataprocessing and analysis like Galaxy and DMAP, providing a mean to perform an accurate protocol fordata manipulation and storage.