INVESTIGADORES
SPETALE Flavio Ezequiel
congresos y reuniones científicas
Título:
Automatic GO prediction of proteins on SARS-CoV-2
Autor/es:
ELIZABETH CHIACCHIERA; TAPIA ELIZABETH; SPETALE FLAVIO EZEQUIEL
Lugar:
Quilmes
Reunión:
Congreso; XI CAB2C; 2021
Resumen:
Background:Gene Ontology (GO) is a structured repository of concepts (GO-terms) including threesub-ontologies, biological process (BP), molecular function (MF), and cellular component (CC).Although gene products should be ideally annotated simultaneously over the three sub-ontologies,with only a few exceptions, in-silico annotation methods work on individual sub-ontologies. Amongthe exceptions, annotation methods based on cross-ontology association rules and interactionnetworks can be mentioned. However, the applicability of these methods is somewhat limited, sinceassociation rules can only be used with GO transitive relationships and interaction networks needhuge amounts of curated data only available for model organisms.Results:In this work, we predict the GO functionality of SARS-CoV-2 proteins using a novel graph-basedMachine Learning package designed for the automatic annotation of protein coding genes across thethree GO subdomains. The package, called FGGA (https://bioconductor.org/packages/fgga), providesfully interpretable graphical annotations amenable to expert analysis. A set of 8574 SARS-CoV-2protein sequences was collated from the UniProt database based on their GO-terms and evidencecodes. The FGGA annotation algorithm assembles individual GO term predictions issued by binarySVM classifiers. Regarding the training of individual SVMs, a minimum of 50 positively annotatedprotein sequences was considered. In addition, to assemble conveniently balanced training datasets,positively annotated protein sequences were complemented with negative annotated proteincounterparts using an inclusive separation policy. Concerning characterization methods of individualprotein sequences in terms of a fixed number of input features, the measurement of 478physicochemical/secondary structure properties and sorting signals were considered. SARS-CoV-2FGGA predictions were evaluated using a 5-fold cross-validation approach and hierarchical Precision,Recall and F-score performance metrics. For the set of 349 GO-terms (BP-CC-MF) analyzed, weobtained 90%, 93% and 91% of the hierarchical Precision, Recall and F-score metrics, respectively.Conclusions:The GO annotation of protein coding genes in viruses by well-established sequence similaritymethods is a challenging task due to their fast-mutating nature. FGGA overcomes this criticallimitation using the power of supervised Machine Learning methods. Preliminary results onSARS-CoV-2 proteins confirm the feasibility of our proposal.