INVESTIGADORES
FARBER Marisa Diana
congresos y reuniones científicas
Título:
Data Merge A nnotation Pipeline (DMAP) ; Utilizing a sequence coordinate based approach t o Prokaryotic annotation
Autor/es:
RIVAROLA M; GONZALEZ SERGIO; BERNARDO J. CLAVIJO; P. FERNANDEZ ; MARTINEZ MC; CERÓN-CUCHI M; CRAVERO S; DOPAZO J; FARBER M; PANIEGO N
Lugar:
Santiago de Chile
Reunión:
Conferencia; ISCB Latin America; 2012
Institución organizadora:
International Society for Computational Biology
Resumen:
During the last few years, as the availability, affordability and magnitude of genomics andgenetics research increases so does the need to provide accurate and reliable access to the resultingdata and combined analyses of genomes. One approach is to combine the outputs from differentsoftware tools and merge the results so as to check the reliability of the merged-output after visualanalisis. Today, more than 1,000 microbial genomes have been completely sequenced, moreover,high-throughput sequencing (Next-Generation-Sequencing: NGS) technologies underscore theimportance of computational methods in annotating and mining genomic data. For example, correctgene structure prediction is vital to deciphering subcellular localization of gene products bydiscovering a possible signal peptide in the usually “overlooked” 5´ region. In addition, correct geneprediction is one of the key steps to understanding the biochemistry, physiology and ecology of theorganism. A simplified method of annotation may lead to over or under-estimation of predictedgenes and misleading boundaries. . In summary, no off-the-shelf solution exists for the assembly,gene prediction, genome annotation and merged-data presentation necessary to interpret and/or fullytake advantage of all genomic features. The huge effort to invest large resources into custombioinformatics support for any genome sequencing project remains a major challenge to fullyunderstand an organism´s genome.In this regard, we present a versatile and customizable approach to bacterial genomecomputational analysis, called DMAP. As a use case study we sequenced the Butyrivibriofibrosolvens genome, a cow rumen living bacteria, using NGS. The pipeline was implemented forprocessing 454 reads and utilized the Chado schema of GMOD (http://gmod.org/wiki/) through aPostgreSQL to load and analyze the functional annotation results. The Chado database was loadedutilizing a web user-interface (web UI) ATGC (see ATGC poster at this meeting). DMAP is capableof merging assemblies from different approaches, gene predictor combining and functionalannotation of genes and gene products from different sources, while displaying all genomic data tothe user in a customized Gbrowse (gmod.org/wiki/GBrowse) with all its analysis in separate tracks.This database is available at http://bioinformatica.inta.gov.ar/fgb2/gbrowse/buty/, with user andpassword restriction access. Recalling the Butyrivibrio fibrosolvens genome project, the pipelineconsisted of three stages; i) Genome Assembly stage: A set of contigs and super-contigs from thede-novo assembly using Newbler (roche.com) are compared to the results from an Optical Mappingapproach, OpGen (Argus Optical Mapping System), and scaffold ordering is performed whilecomputing and checking consistency between the different approaches. ii) Gene prediction: A selftrained Glimmer3 run, with two iterations of identifying false positives through the use of statisticsgathered from the “obvious” sets of genes in combination with several blast (tblastx/blastp/blastx)and mummer (nucmer/promer) results run against different sets of metagenomic and genomic dataiii) Functional Annotations: In an effort to create a reliable set of functional annotations, we paidspecial attention to the version of Sequence Ontology (SO), Gene Ontology (GO), and the Cog2Go(COG: Cluster of Orthologs) and Interpro2GO mapping files in order to guarentee consistencyalong the whole process . With this in mind we followed a strict protocol involving a complete localinstallation of all programs in InterProscan (www.ebi.ac.uk/Tools/pfa/iprscan/) and Blast2Go(http://www.blast2go.com). This database is available athttp://bioinformatica.inta.gov.ar/ATGCbuty/.Taking into account all sets of data, DMAP is designed to merge different results fromdifferent tools and query the merged data through the use of a coordinate based system . The webUI is similar to the querying enabled in the UCSC genome browser (http://genome.ucsc.edu/).Nonetheless, to allow fast and accurate querying, DMAP will load a Postgre-SQL database from theprovided coordinate based files and check for consistency in all files so that all features can bemapped to the reference sequence(s). Overall, the three main goals pursued are: To develop areliable and trustworthy pipeline in genome annotation, to be a comprehensive tool combiningdifferent types of data, and to provide an easy web UI to data querying, which in all will allow dataintegration so a biologist can just “ask the question?”.