INVESTIGADORES
TALEVI Alan
congresos y reuniones científicas
Título:
iRaPCA: A NOVEL METHOD FOR CLUSTERING OF SMALL MOLECULES.
Autor/es:
ALBERCA, L.N.; PRADA GORI, DENIS; BELLERA, C.L. ; TALEVI, A.
Reunión:
Congreso; XI Congreso Argentino de Bioinformática y Biología Computacional 2021; 2021
Institución organizadora:
Asociación Argentina de Bioinformática y Biología Compoutacional
Resumen:
Background:Clustering of molecules implies the organization of a group of molecules into smaller subgroups (clusters) with similar features. Typical applications of this methodology involve the representative splitting of datasets for QSAR and the selection of representative in silico hits from in silico screening experiments for acquisition and submission to experimental confirmation. In this work, we present an in-house hierarchical representative sampling procedure for clustering of small molecules. The approach, which we called iRaPCA, is based on an iterative combination of the random subspace approach (feature bagging), Principal Component Analysis (PCA) and the k-means algorithm. Our method has been converted to webapp so that any user can upload their smiles and perform the clustering with their own parameters. Results:A new online tool for the clustering of molecules has been developed. We have tested our tool in 29 datasets containing between 100 and 5000 small molecules, while comparing these results with the clusters from three other well-known clustering methods (Ward, Complete and Butina methods), as a benchmarking exercise. In all cases, internal validation has been performed. The mean silhouette score obtained for the 29 datasets by our method was 0.9045, while from Ward, Complete and Butina methods the silhouette score was 0.43, 0.42 and 0.27, respectively. Regarding the number of clusters obtained and the percentage of outliers (atypical molecules), iRaPCA on average obtains, for the optimal clustering judging from the silhouette coefficient, 14.2 clusters and less than 1% of outliers, while the other methods on average generate, for the optimal clustering, 221.9 (Ward), 175.8 (Complete) and 127.6 (Butina) clusters and more than 10% of outliers (23% in the case of Butina).Conclusions:iRaPCA has shown a great potential for the generation of dense and separated clusters of molecules as have been demonstrated in the benchmarking exercise. The implementation of our method as a Web App allows users who are unfamiliar with programming to perform a quick and easy clustering of molecules using their own parameters.