BECAS
PETITTI TomÁs Denis
congresos y reuniones científicas
Título:
TAXONOMIC CLASSIFICATION OF 62 GENOMOSPECIES BELONGING TO THE BACILLUS CEREUS GROUP, USING A MACHINE LEARNING APPROACH.
Autor/es:
PETITTI, TOMAS; TORRES MANNO, MARIANO ALBERTO; DAURELIO, LUCAS DAMIAN; ESPARIZ, MARTÍN
Reunión:
Congreso; Congreso conjunto SAIB - SAMIGE 2021; 2021
Institución organizadora:
SAIB en conjunto a la asociación civil de microbiología general (SAMIGE)
Resumen:
Bacillus cereus group is usually categorized into three clades, Clade 1 has pathogenic strains as Bacillus anthracis, Clade 2 is composed of Bacillus cereus sensu stricto, and Bacillus thuringiensis, the former is associated with food poisoning while the latter is used for agronomic purposes for pest control. Clade 3, is the most phylogenetic diverse clade, the strains that compound it, have been isolated from very diverse sources. Classification between species inside B. cereus groups has proven to be very challenging, having reported multiple cases of incorrect classifications or incoherences between taxonomic classification and genomic or phenotypic characteristics. Nevertheless, the correct assignment is of great importance because these assignments are used to predict the performance and safety of bacteria, thus affecting their use for industrial or agronomic purposes. We evaluated, employing the Machine Learning algorithm "Random Forest", gene markers used for the classification of these genomospecies. For this we downloaded from GenBank, 2460 sequences belonging to the three clades, of which 2117 were previously classified by the team and 343 were recently uploaded to the databases, all of which were quality filtered, eliminating 267 sequences. Of the remaining 2191 sequences, 63 were not included in the analysis because they lacked housekeeping genes. The species-level taxonomic identity of the study strains was validated or reassigned using Average Nucleotide Identity (ANI) and multi-locus sequence analysis (MLSA). Thus, 47.13% of the sequences recently uploaded to the database were reassigned. In turn, 5 strains were classified as new genomospecies, named genomospecies 38, 39, 40, 40, 41, and 42.Subsequently, to generate the Random Forest-based classifier, the sequences of 22 gene markers for each of the strains in each clade were divided into a training group and a testing group. From the training group, predictive classification models were generated, which were shown to have accuracy values greater than 98% to assign Clade 1, 2, and 3 species, being the classifiers based on gyrB, pyc, or lon genes those with the highest accuracy. Finally, the testing group was used to see the error of the classifiers, being for Clades 1 and 2 less than 1% and for Clade 3, less than 4%. Therefore, these classifiers will allow mass assignments in metagenomic analysis, as well as assignments of new isolates of the B. cereus group with greater precision.