BECAS
GARCIA LABARI Ignacio
congresos y reuniones científicas
Título:
A music-inspired codification of ncRNAs with pseudoknots
Autor/es:
GARCIA LABARI, IGNACIO; SPETALE, FLAVIO; TAPIA, ELIZABETH
Lugar:
Rosario
Reunión:
Congreso; Congreso; ICB2C Rosario 2023 - Congreso Internacional de Bioinformática y Biología Computaciona; 2023; 2023
Resumen:
Non-coding RNA molecules (ncRNAs) play vital regulatory and structural functions within cells. These functionalities are encoded into complex 2D structures, posing a formidable challenge for their precise modeling and interpretation. In previous work, we introduced a coding-based approach to integrate primary sequence and secondary structure information of ncRNA molecules into sequences of a higher-order alphabet, suitable for ubiquitous machine learning processing. However, it is important to note that representing ncRNA molecules in this way compromises their interpretability.We have improved our coding approach to model ncRNA molecules, with a focus on including pseudoknots. These are intricate secondary structural motifs that exhibit base pair interactions between distant regions of RNA sequences. To facilitate the interpretation of these newly encoded sequences, which now utilize an abstract new alphabet of 28 symbols, we have adopted a music-inspired mapping approach, transforming them into sequences represented by seven musical notes and four distinct tones (piano notes). In this new scenario, ncRNA molecules can be analyzed, and interpreted, with well-established digital signal processing tools. To validate our approach, we undertake decision tree classification on a dataset containing nearly 4000 members belonging to 10 families of ncRNA molecules. These molecules vary in length, spanning from 28 to 2968 nucleotides, and their linear and 2D structures, including pseudoknots, are experimentally known. Each of these sequences is transformed into a Fourier spectrum which serves as an embedding layer that generates fixed feature vectors suitable for interpretable machine learning purposes. Beyond the 93% classification accuracy rate, our results confirm our intuitive expectations, i.e., short and simple molecules lacking pseudoknots are associated with low and medium notes, while long and complex molecules with pseudoknots are associated with high notes.Collapsing the 1D + 2D structure of ncRNAs into musical notes preserves both predictive power and interpretability features. Our next step is to extend this methodology to predict the functionality of long non-coding RNAs with pseudoknots using Gene Ontology (GO) annotations.