INVESTIGADORES
SPETALE Flavio Ezequiel
congresos y reuniones científicas
Título:
Automatic annotation in GO based on Machine Learning in Tetrahymena thermophila
Autor/es:
COSTA JOAQUIN; BRACALENTE FERNANDO; ARABOLAZA ANA LORENA; GRAMAJO HUGO; UTTARO ANTONIO; SPETALE FLAVIO EZEQUIEL
Lugar:
Quilmes
Reunión:
Congreso; XI CAB2C; 2021
Resumen:
Background:Tetrahymena thermophila is an unicellular ciliate that combines the complexity of cellular processes in eukaryotes with easiness in genetic manipulation and cultivation. In particular, the study of its sterol metabolism, that shifts from synthetizing the terpenoid alcohol ?tetrahymanol? to assimilating and modifying sterols from its diet, when available, would help to understand cholesterol transport and homeostasis in higher eukaryotes. However, the mechanistic details of sterol uptake, intracellular transport and signaling systems in T. thermophila are still poorly known. The Gene Ontology (GO) terms associated with these functionalities are scarce in this microorganism, making it difficult to identify related proteins. Standard methods for the annotation of protein-coding genes based on sequence similarity, i.e., Blast2go, do not give good results. Therefore, automatic functional annotation methods based on machine learning (ML) rise as an alternative to standard methods. In this sense, this work aims to fill the gap in the annotation of T. thermophila proteins, emphasizing in GO terms of the sterol metabolism. Results:In this work, we predict the GO functionality of T. thermophila proteins using a novel graph-based ML package designed, FGGA, for the automatic annotation of proteins across the three GO subdomains. Proteins from T. thermophila with GO terms were collected from the UniProt database, and five GO terms involved in the sterol metabolism were enriched with proteins from related organisms. The FGGA annotation algorithm assembles individual GO term predictions issued by binary SVM classifiers. Regarding the training of individual SVMs, a minimum of 50 positively annotated protein sequences was considered. In addition, to assemble conveniently balanced training datasets, positively annotated protein sequences were complemented with negative annotated protein counterparts using an inclusive separation policy. Concerning characterization methods of individual protein sequences in terms of a fixed number of input features, the measurement of 89 Pfam domains were considered. FGGA predictions were evaluated using a 20% test dataset and hierarchical Precision, Recall and F-score performance metrics. For the set of 535 GO-terms (BP-CC-MF) analyzed, we obtained 56%, 90% and 65% of the hierarchical Precision, Recall and F-score metrics, respectively. The validation process included three proteins (Q22MT, I7M195 and A4VD37) that showed a significant expression change in a previous RNA-Seq experiment under the presence of cholesterol. None has any annotation reported, being the algorithm able to classify them in sterol metabolism related GO-terms.Conclusions:The characterization of protein domain families has shown to be a valid approach for the classification of T. thermophila proteins.