BECAS
PETITTI TomÁs Denis
congresos y reuniones científicas
Título:
TAXONOMIC CLASSIFICATION OF 2616 STRAINS BELONGING TO THE Bacillus subtilis GROUP
Autor/es:
PETITTI, TOMAS; TORRES MANNO, MARIANO ALBERTO; DAURELIO, LUCAS DAMIAN; ESPARIZ, MARTÍN
Reunión:
Congreso; XII Argentine Congress of Bioinformatics and Computational Biology; 2022
Institución organizadora:
Asociación Argentina de Bioinformática y Biología Computacional
Resumen:
Background: Due to the diversity of the B. subtilis clade, multiple cases of inconsistencies betweentaxonomic classification and genomic or phenotypic characteristics or incorrect classifications have been reported. However, the proper assignment is critical since these are used to estimate safety and performance of bacteria, impacting its use in industry and agriculture.Results: In order to correct the taxonomic classification of genomes of the clade B. subtilis, 2625sequences belonging to the clade were downloaded. Then, 133 low quality sequences were eliminated.The taxonomic identity was validated or reassigned using Average Nucleotide Identity and multi-locussequence analysis. Thus, 29.5% of the sequences were reassigned. In turn, 148 strains were classifiedinto 12 new genomospecies. The K-mers of the gyrA gene sequences were obtained. Classification models based on RF and SVM with linear and radial kernels were generated. The predictive capacity of these models was evaluated using K-Fold repeated CV. For the RF algorithm, a tuning grid of hyperparameters were evaluated, obtaining values of Kappa between 0.9968 and 0.9984. Using the testing set, Kappa values between 0.993 and 1.0 and values of AUC = 1 were obtained, verifying that the models do not perform overfitting. The linear kernel and the radial kernel were evaluated with different hyperparameters. The linear kernel for the best model (c = 2 0.5 ) showed an error of 0.1% for the training set; for the testing set Kappa and AUC were 0.9969 and 0.9995, respectively, indicating that the model does not perform overfitting. The best model using the radial kernel showed an error of 17.7% for the training set. The testing set showed a Kappa of 0.472 and an AUC of 0.82.Conclusions: These results indicate that data is linearly separable, and that RF is the best algorithm for rapid and massive classification. From these results, the development of a tool that allows theclassification of metagenomes within the B. subtilis clade, with similar levels of resolution and less time required than whole genome analysis, is proposed.