INVESTIGADORES
KAMENETZKY Laura
congresos y reuniones científicas
Título:
Novel microRNA discovery from genome-wide data: a computational pipeline with unsupervised machine learning
Autor/es:
STEGMAYER, G.; YONES C; KAMENETZKY, L.; MACCHIAROLI, N.; PÉREZ, M.; ROSENZVIT, M. C.; MILONE, D.
Reunión:
Conferencia; IV ISCB-LA A2B2C Bioinformatics Conference; 2016
Resumen:
Background:There are several challenges related to the computational prediction of novel microRNAs(miRNAs), especially from genome-wide data and non-model organisms. First of all, manypre-processing steps on the raw data must be done to cut it into sequences, which involvethe selection and use of a variety of software packages written in different programminglanguages, with many different possible configurations and parameters, most of the timeunclear and very difficult to set by the final user. After that, each sequence must be analyzedone by one to classify it as possible candidate to pre-miRNA. The classical way of doing thishas been training a binary supervised classifier with well-known pre-miRNAs (for example,extracted from miRBase) and artificially defining the no-pre-miRNA class, which is verydifficult. Thus, a single, complete, and simple procedure for unsupervised pre-miRNAprediction from genome-wide data is of high interest today.Results:We have developed an integrated pipeline (Fig. 1) of just 5 simple steps, that starts fromgenome-wide data and can be applied for model and non-model organisms. First, anintelligent pre-processing cuts the genome into overlapped windows of nucleotides(sequences) with greater length than the mean pre-miRNA length of the species underanalysis (or a phylogenetically related one, if well-known are not available). The secondarystructure is predicted using RNAfold and pre-miRNA properties are verified, such as foldinginto stem-loops with a fixed value of minimum free energy. Multi-loop segments are split andduplicated stem-loops are deleted. If available, known RNA can be filtered as well. Theremaining sequences go through a feature extraction process that can calculate allpublished features up to date. The resulting feature vectors are used for training anunsupervised machine learning model named miRNA-SOM, a deep architecture of severalnested SOMs (Self organizing maps) that requires only positive labelled examples (the wellknownpre-miRNAs). It clusters all the unlabelled sequences with the pre-miRNAs. miRNASOMallows for the quick identification of the best candidates to pre-miRNAs as thosesequences clustered together with known precursors at the last level of the deep model.We have performed a benchmarking test with the Caenorhabditis elegans full genome. Thepipeline was applied obtaining 1,739,124 sequences. From miRBase v17, 200 well-knownpre-miRNAs of C. elegans were used as positive labelled samples. After training, theunsupervised prediction model has been tested with the pre-miRNAs more recently added tomiRBase v18-21 and absent in v17. In this test, 44 out of 48 have been identified as positive,resulting in a model sensitivity of 92%. The proposed pipeline was also applied forEchinococcus multilocularis and Taenia solium parasites genomes, allowing the effectivediscovery of 11 and 7 novel pre-miRNAs, respectively. These novel candidates have beeneven validated afterwards with ?wet? experiments and RNA-seq data.