CIFASIS   20631
CENTRO INTERNACIONAL FRANCO ARGENTINO DE CIENCIAS DE LA INFORMACION Y DE SISTEMAS
Unidad Ejecutora - UE
artículos
Título:
Methionine Exploration of Trypanosomes Software Tool
Autor/es:
P. BULACIO; L. ANGELONE; L. ESTEBAN; E. SERRA; E. TAPIA
Revista:
EMBnet.journal
Editorial:
EMBnet
Referencias:
Año: 2009 vol. 16 p. 44 - 45
ISSN:
1023-4144
Resumen:
Background: The automatic identification of Translation Initiation Sites (TISs) remains a challenging problem for gene prediction. Briefly, standard tools such as Glimmer, MED-Star and GSFinder are based on looking for open reading frames with a statistically significant minimal length, which may work on prokaryotic sequences but not in eukaryotic ones (Gopal et al. Nucleic Acids Res. 2003, 31:5877¨C5885). A feasible justification is that scores derived from a trained statistical model considers only coding regions information. These scores make sense only for highly compact bacterial (prokaryotic) genomes, with high frequency of coding sequences. However, coding regions in protozoa frequently represents less than 10 % of the genome (El-Sayed et al. Science 2005, 309:404¨C409), giving misleading training sets. MET is a computational tool for TISs prediction in Trypanosomes. Its main goal is the simplicity and accuracy of its TIS prediction method. MET architecture, based on GSEA, consists of LOAD data, LEARN, PREDICT, ANALYSIS HISTORY, and REPORTS modules. The core process is done by PREDICT module which implements a heuristic that requires a knowledge model to classify sequences into CODing and NO CODing. Such a model can be inferred using LEARN module (AdaBoost DS). The final result is a ranking of potential TISs and corresponding p-values. Methods: PREDICT module implements classification and exploration tasks. The first step classifies the input sequence S into COD, NO-COD subsegments (COD if Pi > 0.5 or NO COD if Pi< 0.5). If COD subsegments exist, the process starts with the parsing of S taking into account the first 10 ATGs: Si ¡û S(ATGi, ...,ATGn), n ¡Ü 10. Inside this subsequence set, potential TISs are searched by a pruning process: potential TIS are those ATGs preceding two coding subsegments, allowing one gap, i.e., COD-COD or COD-NO COD-COD. Once potential TISs have been identified, a MET score according to the probabilities of the initial classification is associated with each subsequence: Mi ¡ûProd p(Si,j), with j=1 to n. Finally, the statistical analysis (permutation tests) evaluates the most reliable TISs. Results: The Trypanosoma Cruzi organism is analysed. T. Cruzi sequencing projects (http://tritrypdb.org) search regions of DNA associated with the Chagas disease, involving the TIS discovery for gene identification. TIS identification results with MET in T.Cruzi sequences are shown within and embedded browser consisting of a table with candidate TIS positions and p-values, and a graphical view of raw coding scores from the core AdaBoost classifier. The graphical MET output can be used for supplementary TIS inspection. Conclusion: The availability of user friendly software is an important issue in current Bioinformatics research. TIS prediction with MET just requires a well-curated dataset of COD and NO COD sequences. As a result of its data-driven approach, MET may be well suited for TIS prediction in hard to analyze genomes like T.Cruzi.