INVESTIGADORES
BRUNO Cecilia Ines
congresos y reuniones científicas
Título:
. Validation indices for Bayesian cluster analysis in population structure genetics with high-dimensional data
Autor/es:
VIDELA, M.E.; BRUNO C
Lugar:
Cordoba
Reunión:
Simposio; 1st Plant Breeding Symposium; 2021
Institución organizadora:
INTA
Resumen:
IntroductionClustering validation is applied to the evaluation of clustering results and has been considered a complement for clustering application. Several indices have been proposed to determine the optimal number of groups. However, there is no consensus about which are the best. The objective of the study was to compare the performance of internal validation indices to obtain the optimal cluster that distinguishes population genetic structure (PGS) after applying the Bayesian clustering method.Materials and methodsWe conducted a simulation study with SNP molecular markers to illustrate the maize genetic structure. We designed nine biological scenarios with three levels of genetic differentiation and three numbers of population groups to achieve PGS. Thus, we obtained a low level of genetic divergence (L) with Fst=0.03, a medium level (M) with Fst=0.05, and a high genetic divergence level (H) with Fst=0.07. Three different numbers of population groups (k) were arranged: k=2, K=5 and, K=10. We obtained 900 datasets of 80 k SNP, each containing 1000 individuals. The simulated datasets were obtained using the package “Xbreed” in R (Esfandyari and Sørensen 2019). We compared CH, Connectivity, Dunn and Silhouette indices for the validation of big data clusters obtained by Bayesian method implemented with the package “LEA” in R. Each validation index has its own optimization criteria, from which a given number of clusters is proposed. We compared the number of groups suggested by the validation index with the value that should have been suggested by the index according to the simulation. Thus, we counted the number of times the index suggested a number of incorrect groups as a classification error rate (type III error (E III)). The classification error might occur either because the number of estimated groups is higher or lower than the simulated one. We discriminated between overestimation (E III+) and underestimation (E III-) of the number of groups.ResultsThere was no overestimation for any of the validation indices in any of the genetic divergence levels with k=2 populations (EIII+=0%). In scenarios with k=5 populations, Silhouette and Dun indices did not have misclassification rate. However, CH index matched the true number of groups 87-97% of the times. Connectivity index underestimated the number of groups between 94 and 96% of the times, and indicated two groups in 52-55% of the replicates. Silhouette and Dunn indices had a similar behavior with k=10 populations, both without misclassification rate. CH index underestimated groups by 12-17%, whereas Connectivity did so by 100%. Dunn and Silhouette indices had null overestimation and underestimation type III errors.ConclusionsOur results suggest that in a context of high number of population groups the Dunn and Silhouette indices have the best performance by combining them with the Bayesian clustering algorithm.