SINC(I)   25518
INSTITUTO DE INVESTIGACION EN SEÑALES, SISTEMAS E INTELIGENCIA COMPUTACIONAL
Unidad Ejecutora - UE
congresos y reuniones científicas
Título:
Automatic extraction of hairpin sequences from genome­-wide data
Autor/es:
YONES, C.A.; MILONE, D.H.; STEGMAYER, G.; KAMENETZKY, L.
Reunión:
Conferencia; 4th ISCB-LA Bioinformatics Conference; 2016
Resumen:
Extracting stemloop sequences from raw genome wide data is very important for some data mining tasks in bioinformatics. For example, to discover new microRNA precursors (pre-miRNA) in genome wide data using machine learning techniques, which need as input the stem loop sequences to build a prediction model. The genome preprocessing is very important because it has a strong influence on the prediction results. All well known premiRNAs must be found in the resulting sequences, thus all hairpins must be properly trimmed. Although there are some scripts that can be adapted and put together to achieve this task, to the best of our knowledge they are outdated and do not take advantage of the latest advances in secondary structure prediction. For example, the scripts from mirCheck for searching hairpins use Einverted, developed in 1999. We have developed a simple and integrated tool that automatically extracts and folds all hairpin sequences from raw genome wide data. It predicts the secondary structure of several overlapped segments of the raw genome, with longer length than the mean of well known pre-miRNAs of the species under processing, ensuring that no pre-miRNA is inappropriately cut. Stem loops that meet specified requirements are extracted and trimmed. Each genome can be processed in parallel and the implementation is memory efficient since it can automatically split large multi-fasta files. Several genomes were processed and the results were compared to those from mirCheck. The number of hairpins and known pre-miRNAs found were significatively higher for the proposed method.