ICIAGRO-LITORAL   28228
INSTITUTO DE CIENCIAS AGROPECUARIAS DEL LITORAL
Unidad Ejecutora - UE
congresos y reuniones científicas
Título:
Development of a web-based platform for taxonomic classification of genomospecies belonging to the B. cereus group using machine learning
Autor/es:
PETITTI TOMÁS; DAURELIO, LUCAS D.; TORRES MANNO, MARIANO A.; ESPARIZ, MARTÍN; CABRERA LORENZO
Lugar:
Buenos Aires
Reunión:
Congreso; XI Congreso Argentino de Bioinformática y Biología Computacional; 2021
Institución organizadora:
A2B2C
Resumen:
Background:The Bacillus cereus group is usually categorized into three clades, Clade 1 has pathogenic strains as Bacillus anthracis, Clade 2 is composed of Bacillus cereus sensu stricto, and Bacillus thuringiensis, the former is associated with food poisoning while the latter is used for agronomic purposes for pest control. Clade 3 is the most phylogenetically diverse clade; the strains that compound it have been isolated from very diverse sources. Classification between species within the B. cereus group has proven to be very challenging, having reported multiple cases of incorrect classifications or incoherences between taxonomic classification and genomic or phenotypic characteristics. Nevertheless, the correct assignment is of great importance because these assignments are used to predict the performance and safety of bacteria, thus affecting their use for industrial or agronomic purposes.Results:In this work, we use the Machine Learning algorithm, Random Forest, to generate classifiers based on gene markers reported for this group. First, 2460 sequences belonging to the Bacillus cereus group were downloaded from GenBank. Of which 2117 were already classified by us, while the remaining 343 were recently uploaded to the database. They were filtered by quality, using parameters of N50, genome size, and the total number of contigs. In this way, 2191 sequences were obtained and validated or reassigned to species using Average Nucleotide Identity (ANI) and multi-locus sequence analysis (MLSA). This resulted in a reassignment of 47.13% of the recently uploaded sequences to the databases. In addition, 5 strains were classified as new genomospecies, named genomospecies 38, 39, 40, 41, and 42.In order to generate Random Forest-based classifiers, the sequences of 22 gene markers from each of the strains were divided into a training group and a testing group. Of the 2191 sequences, 63 were not included in this analysis because they lacked some gene markers, suggesting that they were incomplete. From the training group, classifier models were generated; their accuracy was evaluated by cross-validation. Thus, it was observed that the classifiers generated to assign the genomospecies of sequences belonging to any of the 3 clades, had an accuracy higher than 98%, being those based on gyrB, pyc or lon, the ones with the highest accuracy. Then, the testing group was used to observe the error of the classifiers, presenting an error of less than 1% for Clades 1 and 2 and less than 4% for Clade 3. Finally, a web-based platform was built to make use of the scripts in a simpler and more user-friendly way.Conclusions:The results show that the classifiers generated allow classifying with high accuracy, and the error obtained by using the evaluation group indicates that the models are not over-fitted. The realization of a web platform will make it easier for non-experts in the field to make use of the scripts, reaching a larger number of people. Thus, these classifiers will allow performing mass assignments in metagenomic analysis as well as assignments of new B. cereus isolates in a fast and accurate way.