BECAS
FENOY Luis Emilio
congresos y reuniones científicas
Título:
Deep computational prediction of protein annotations combining sequence + structural learned embeddings.
Autor/es:
EMILIO FENOY; GEORGINA STEGMAYER
Lugar:
Corrientes, Argentina
Reunión:
Congreso; XIICAB2C - 12th Argentinian Conference in Bioinformatics and Computational Biology; 2022
Institución organizadora:
Asociación Argentina de Bioinformática y Biología Computacional
Resumen:
BACKGROUNDImprovements in experimental methods have generated rapid growth in the volume of protein data. In this scenario, the use of automatic methods has become critical to aid curation-based annotation. Many computational approaches to predict protein activities, properties, interactions, structure and functions have been proposed in the last few years. As a result of the increasing interest of the bioinformatician community regarding this topic, public benchmarks such as the Critical Assessment of Functional Annotation (CAFA) challenge were created, in which participants predict Gene Ontology (GO) terms for target proteins. However, the last results from the challenge report very low performance of the methods. We propose to increase the prediction rate by using representation learning. Nowadays, there are available many protein representation learning methods that calculate feature vectors for samples in a dataset. Such vectors, also known as embeddings, define a relatively low-dimensional space able to efficiently encode high-dimensional data. Recently, many protein representation methods have been developed, which integrate different types of protein information in supervised or unsupervised approaches, and provide embeddings based on sequence or protein structure. In this work, we propose a novel method to predict protein annotation by combining sequence and structural information encoded with state-of-the-art embeddings in a deep learning model.RESULTSWe used 9,500 Homo sapiens proteins from the CAFA3 to test the most recent sequence embedding methods and selected the best performing one, ESM-1b, a Transformer-based model trained over the full UniProtKB database. To complement the sequence embedding with structural information we predicted the tertiary structure of the proteins in the dataset using AlphaFold and used a novel structural embedding method called ESM-IF. This structure representation method was tested alone to predict GO annotations, obtaining state-of-the-art performance so far. Preliminary results on the combination of this method with the sequence embedding indicate that it is possible to boost their individual performance when used altogether. CONCLUSIONSThe automatic annotation of proteins is a challenging task that requires new methods and strategies to be addressed. Here we state that combining sequence and structural information obtained from the current state-of-the-art methods such as AlphaFold, ESM-1b, and ESM-IF can improve the accuracy of the Gene Ontology annotation from a protein.