INVESTIGADORES
BALZARINI Monica Graciela
congresos y reuniones científicas
Título:
Comparison of algorithms to detect clusters of multivariate genetic data
Autor/es:
BRUNO, C.; RUEDA CALDERÓN, A.; PEÑA MALAVERA, A.; BALZARINI, M.
Lugar:
Victoria
Reunión:
Conferencia; XXVIIIth International Biometric Conference; 2016
Resumen:
The aim of this work is to compare classification methods to detect genetic structure in populations or collections of genotypes characterized via multiple genetic markers. Unsupervised clustering methods were used, of the family of neural networks, such as Self-organizing Map (SOM); these methods are designed to detect patterns or make classifications from multidimensional observations, where underlying patterns are unknown a priori. The number of clusters suggested by SOM was determined via a screen plot. SOM performance in the search of genetic structure was compared with UPGMA hierarchical cluster and non-hierarchical cluster (K-means). For these clustering methods, the number of clusters was determined using 19 indices (library Nbclust in R) and the majority rule. Internal validation of classifications was performed for all algorithmsi (library clValid in R). Moreover, k-means combined with AMOVA in the software AMOVA-based clustering was used; this tool uses the sum of squares of AMOVA for clustering, based on the Euclidean distance matrix between profiles of molecular markers. The procedure is improved by applying simulated annealing via Monte Carlo Markov Chain to save clustering from becoming stuck at a local optimum. Using this procedure, three statistics were estimated to determine the number of clusters: pseudo-F-statistic, Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC). The methods were evaluated using simulated databases with three and five populations and three genetic divergence levels (FST=0.06-0.07, FST=0.17-0.23 and FST=0.38), using co-dominant molecular markers. At the highest genetic divergence levels, all the classification methods detected the three simulated populations. For low FST values, none of the methods identified the number of underlying clusters.