INVESTIGADORES
AGÜERO Fernan Gonzalo
congresos y reuniones científicas
Título:
Genómica y Bioinformática en Trypanosoma cruzi: reconstrucción de transcriptos, identificacion de SNPs y desarrollo de bases de datos
Autor/es:
FERNÁN AGÜERO,
Lugar:
Buenos Aires, Argentina
Reunión:
Otro; Segundas Jornadas Iberoamericanas de Bioinformatica; 2006
Institución organizadora:
CyTED
Resumen:
Genómica y bioinformática en Trypanosoma cruzi: reconstrucción de transcriptos, identificación de SNPs y desarrollo de bases de datos. Fernán Agüero, <fernan@unsam.edu.ar> Universidad Nacional de San Martín, Argentina Trypanosoma cruzi, the causal agent of Chagas´ Disease, is a eukaryotic protozoan parasite that is endemic to most of Central and South America. T. cruzi has unusual biological features, some of which are shared with other trypanosomatids. At the molecular level, the transcription of genes is polycistronic, requiring post-transcriptional processing to obtain mature mRNAs. This processing includes the addition of a common 5´ exon to all mRNAs, by trans-splicing. At the cellular level, the replication of the parasite is mostly asexual (clonal). Although exchange of genetic material between individuals has been observed, it is a rare event and does not seem to involve meiosis. Rather, formation of hybrids by fusion, followed by recombination and resolution of aneuploidies is the proposed mechanism of genetic exchange. The T. cruzi genome has been sequenced using a whole genome shotgun strategy. But before the genome was finished, a number of gene discovery efforts led to the production of ~ 14,000 expressed sequence tags (ESTs). We have used these EST datasets to do transcript reconstruction. As a first step towards this end we have curated the available information, to separate datasets based on their originating cDNA libraries. After that sequences were masked against a database of T. cruzi repetitive elements and grouped into clusters using either a genome-mapping approach or a similarity based approach. In the first approach, partial transcript sequences are first mapped against the genome. Then, sequences that map to the same genomic regions are grouped together into the same cluster. In the second approach, partial transcript sequences are first compared against each other using an alignment-independent measure of their similarity. Sequences are then grouped based on this ´distance´ measure. In all cases, after the initial clustering is done, a multiple sequence alignment is obtained for each cluster. As a result of this work we obtained 3,790 reconstructed transcripts, that were mapped to the T. cruzi genome. This work has been integrated into TcruziDB (http://tcruzidb.org), the Trypanosoma cruzi Genome Database. As part of this integration it is now possible to query the T. cruzi genome for genes that have evidence of expression (at the transcriptional level), and to visually recognize the regions of the genome where ESTs have been mapped. Moreover, since the cDNA library information is preserved for each set of ESTs, queries asking for genes that have evidence of transcription in a particular cell type or developmental stage are now possible. Finally, using the multiple sequence alignments containing ESTs as a starting point, we have initiated an effort to map sequence variation (single nucleotide polymorphisms, SNPs) in T. cruzi. The strain chosen for the genome project, CL Brener, is an hybrid strain, composed of two parental haplotypes that are slightly divergent. As a consequence, the final assembly of the genome contains many genomic regions that are represented by two consensus sequences, one per parental haplotype. We have taken advantage of this fact, by adding genes, as annotated by the genome project to our EST-based clusters, and generating new multiple sequence alignments for these expanded clusters. As a first step we have identified SNPs in single-copy genes, to avoid dealing with potential assembly artifacts associated with large gene families. Thus, we have selected ~ 400 clusters for which we had one or two coding sequences obtained from the genome project aligned with EST data. We have scanned these alignments to look for columns containing sequence variation. These candidate polymorphic sites have been analyzed using a statistical package (PolyBayes) together with custom filters aimed to identify regions containing a high-density of polymorphic sites. As a result of this, we have identified 341,087 putative SNPs in these clusters. 159,517 (47%) of these putative SNPs fall into regions of high density of polymorphic sites, which might correspond to regions of low sequence quality. We have also identified 5,338 (1.6%) SNPs that correspond to SNPs with a probability of being a true SNP > 0.7 (as assigned by PolyBayes). We have also analyzed SNPs that are located within coding sequences in the context of the underlying translational reading frame, and identified SNPs causing non-synonymous mutations or premature stop codons. In this set of clusters we have identified 789, 599 and 8 SNPs that resulted in synonymous, non-synonymous and nonsense mutations, respectively.