CONICET | Buscador de Institutos y Recursos Humanos

BACKGROUNDDue to the diversity of the B. subtilis clade, multiple cases of inconsistencies between taxonomic classification and genomic or phenotypic characteristics or incorrect classifications have been reported. However, the proper assignment is critical since these are used to estimate safety and performance of bacteria, impacting its use in industry and agriculture.RESULTSIn order to correct the taxonomic classification of genomes of the clade B. subtilis, 2625 sequences belonging to the clade were downloaded. Then, 133 low quality sequences were eliminated. The taxonomic identity was validated or reassigned using Average Nucleotide Identity and multi-locus sequence analysis. Thus, 29.5% of the sequences were reassigned. In turn, 148 strains were classified into 12 new genomospecies.The K-mers of the gyrA gene sequences were obtained. Classification models based on RF and SVM with linear and radial kernels were generated. The predictive capacity of these models was evaluated using K-Fold repeated CV.For the RF algorithm, a tuning grid of hyperparameters were evaluated, obtaining values of Kappa between 0.9968 and 0.9984. Using the testing set, Kappa values between 0.993 and 1.0 and values of AUC = 1 were obtained, verifying that the models do not perform overfitting.The linear kernel and the radial kernel were evaluated with different hyperparameters. The linear kernel for the best model (c = 20.5) showed an error of 0.1% for the training set; for the testing set Kappa and AUC were 0.9969 and 0.9995, respectively, indicating that the model does not perform overfitting. The best model using the radial kernel showed an error of 17.7 % for the training set. The testing set showed a Kappa of 0.472 and an AUC of 0.82.CONCLUSIONSThese results indicate that data is linearly separable, and that RF is the best algorithm for rapid and massive classification. From these results, the development of a tool that allows the classification of metagenomes within the B. subtilis clade, with similar levels of resolution and less time required than whole genome analysis, is proposed.