INVESTIGADORES
FARBER Marisa Diana
congresos y reuniones científicas
Título:
MLST-pipeline: a Galaxy-based workflow for the analysis of Multilocus Sequence Typing schemes
Autor/es:
LEW SERGIO; GONZALEZ SERGIO; GUILLEMI E.; WILKOWSKY S.E.; FARBER MD
Lugar:
Paris
Reunión:
Congreso; 10th International Meeting on Microbial Epidemiological Markers (IMMEM-10); 2013
Institución organizadora:
Institut Pasteur
Resumen:
Multilocus sequence typing (MLST) is an unambiguous procedure for characterizing isolates of microorganism species using the sequences of internal fragments of seven house-keeping genes. Approx. 450-500 bp internal fragments of each gene are used, as these can be accurately sequenced on both strands using an automated DNA sequencer. For each house-keeping gene, the different sequences present within a microorganism species are assigned as distinct alleles and, for each isolate, the alleles at each of the seven loci define the allelic profile or sequence type (ST) (1). MLST has been successfully used to study population genetics and to reconstruct micro-evolution of epidemic bacteria, fungus and protozoa. Highly reproducible data together with the availability of low-cost sequencing services make MLST a powerful tool. However, manually intensive steps of processing raw sequence data files and downstream analysis hindered the application of the methodology. In the last years, several bioinformatic workflow systems have emerged due to big data. Among them Galaxy, an open web-based platform for data intensive processing (http://galaxyproject.org/), is a powerful and flexible alternative for the analysis of genomics and post-genomics projects. Here, we introduce MLST-pipeline, a Galaxy-base tool for dealing with Multi Locus Sequence Typing data analysis. Dedicated tools for data uploading (raw chromatogram or fasta files), base calling (PHRED), contig assembling (CAP3), sequence trimming and typing were wrapped in order to be called from the Galaxy user interface. During trimming operations, a user-defined primer is defined to detect the sequence start site while the end site is computed according to predetermined sequence lengths. The ST wrapper searches for existent allelic variants in the local Mysql DB and generates an HTML report that includes the isolate haplotype and a fasta formatted concatenated of the uploaded alleles. Sequences that do not match DB variants already stored are reported as putative new alleles, waiting for manual curation as a quality control step before DB updating. Validation using bacterial (Anaplasma marginale) and protozoa (Babesia bovis and B. bigemina) datasets revealed complete agreement between the results generated by manual and automated workflows.