INVESTIGADORES
BRUNO Cecilia Ines
congresos y reuniones científicas
Título:
Clustering of massive genomic data
Autor/es:
VIDELA, M.E.; IGLESIAS, J.; BALZARINI, M.; BRUNO, C.
Lugar:
Barcelona
Reunión:
Conferencia; XXIXth International Biometric Conference; 2018
Resumen:
Current technologies in genomics have enabled to generate large volumes of data which could have thousands of variables characterizing a biological unit. The challenges involve genomic data coding, preprocessing and analytics. Identifying population genetic structure from genomic data is crucial for breeding and conservation. Several clustering algorithms are available to be used with genomic data to group several genotypes. When working with massive genetic data, computational and validation clustering problems are important issues. In this work, eight methods to identify clusters of maize genotypes from largely unlinked molecular marker data were compared using experimental data obtained from Single-Nucleotide Polymorphism (SNP) markers and encoded for analysis in three diverse ways. Each dataset contains more than 50K SNP markers for 300 to 500 stabilized maize lines. We assess the relative performance of the following clustering methods: Divisive Clustering Analysis (DIANA), Partitioning Around Medoids (PAM), Agglomerative Nesting (AGNES), Unweighted Pair Group Method with Arithmetic Mean (UPGMA), K- means, Kernel K-means, Fuzzy K-means and Ward´s method. Genomic data were coded in three ways: binary (0 represents homozygous alleles, and 1 a mutation), allelic frequency (0 represents homozygous alleles and mutations were encoded with the frequency of changes) and categorical (R, M, Y, K, S and W represent changes of the type A/G, A/C, T/C, T/G, C/G and A/T, respectively). Distances were selected as dependent on data coding. The clustering validation for each algorithm was performed using indices of Dunn, connectivity, and silhouette.