BECAS
PETITTI TomÁs Denis
congresos y reuniones científicas
Título:
TAXONOMIC GENE MARKER BENCHMARKING BY USING A MACHINE LEARNING APPROACH
Autor/es:
PETITTI, TOMAS; TORRES MANNO, MARIANO ALBERTO; MAGNI, CHRISTIAN; ESPARIZ, MARTÍN
Reunión:
Congreso; Congreso conjunto SAIB - SAMIGE 2020; 2020
Institución organizadora:
SAIB en conjunto a la asociación civil de microbiología general (SAMIGE)
Resumen:
Bacillus cereus of Clade 2 is composed of B. thuringiensis and B. cereus sensu stricto genomospecies. The former has important agronomic applications whereas the latter is usually associated with food poisoning. Here, gene markers used for identification of these genomospecies were evaluated using "Random Forest", a machine learning technique based on decision trees. First, 2459 available genomes of B. cereus group were downloaded from Genbank. In order to select B. cereus sensu stricto and B. thuringiensis genomes, their average nucleotide identities (ANI) were computed against type strains B. thuringiensis ATCC_10792 and B. cereus sensu stricto ATCC 14579. 1253 out of all genomes were selected for further studies as they shared an ANI greater than the species-threshold of 96% with type strains of Clade 2. We determined as minimum quality thresholds criteria for genome exclusion based on their deviation (mean ± 2 sd) from expected genome sizes or the number of contigs (n) and N50 parameters of its assembly. Those genomes with n > 616, N50 < 28.036, size < 4.940.889 bp or size > 6.536.009 bp were classified as of low-quality and excluded. To verify their genomospecies assignments the resulting 863 sequences were further analyzed using a phylogenetic approach with 104 common ancestral genes present in all genomes under analysis including the outgroups B. anthracis Ames and B. mycoides ATCC 6462. Then, a training group with 697 strains was selected to train a forest of 10.000 trees and construct a classifier of genomospecies using the gene distances of 22 taxonomic gene markers as variables (15.334 variables). DNA gyrase subunit A (gyrA), pyruvate carboxylase (pyc), and DNA topoisomerase (gyrB) were found the 3 most important markers for the classifier. One-gene classifiers with a forest of 1.000 trees were constructed using the gene distance of gyrA, pyc, or gyrB genes (697 variables each one). Noteworthy, cross-validation analyses of these classifiers showed that the accuracy and kappa parameters were zero and one in all cases, respectively. Then, correlated variables (at 0.999) were pruned by preprocessing the data reducing the variables from 697 to 7, 17, and 8 for gyrA, pyc, and gyrB classifiers, respectively. Finally, the error rates were computed for the classifiers using the testing group (158 strains). No misclassifications were observed indicating that the classifiers are accurate as well as unbiased. Our pipeline could be used to select proper taxonomic markers to massively assign genomospecies identities in comparative genomic or metagenomic studies at a high-resolution level.