INVESTIGADORES
SOTO Axel Juan
congresos y reuniones científicas
Título:
Identifying the applicability domain of QSPR models using machine learning
Autor/es:
AXEL JUAN SOTO; IGNACIO PONZONI; GUSTAVO ESTEBAN VAZQUEZ
Lugar:
Termas de Chillan
Reunión:
Conferencia; I Iberoamerican Conference on Bioinformatics; 2010
Institución organizadora:
Sociedad Iberoamericana de Bioinformática (SoiBio)
Resumen:
QSPR (Quantitative Structure-Property Relationships) involve modeling methods used forpredicting physicochemical properties of compounds from their molecular structure (descriptors). These in silico methods identify promising compounds for experimental analysis, helping to disregard useless compounds before being synthesized. Therefore, the design of accurate QSPR models is a crucial issue in drug research, yielding to important time and economic savings. Several works1,2 describes main issues to achieve reliable models. A major problem is the lacking of generalization capabilities when non-homogeneous data are used. In other words, an unseen compound could be outof the model applicability domain (AD), and hence its prediction is prone to be not reliable. A methodology to determine the AD of QSPR models generated using machine learning methods was proposed by Soto et al3. This procedure combines supervised and unsupervised learning methods. The goal of the unsupervised method is to detect regions within the training chemical space, wherein the behavior of the supervised method turns unpredictable. A main contribution resides in the fact that, while other proposals are only interested in checking similarity to the training set as a whole (extrapolation problems), here the AD is also checked within the training set (interpolation problems). This new technique use self-organizing maps and the Hotelling statistic for detecting if a prediction of a compound in the test set is confident or unconfident, where confident refers to the probability of having a low prediction error when a specific prediction method is used. Four data sets for different properties - blood-brain barrier permeation (DS1), intestinal absorption (DS2) and octanol-water partition coefficient (DS3&DS4) - were used for testing. The tables show the prediction errors in terms of mean average error for the training set (T), testing set (H) and the sets classified as confident (S1) and unconfident (S2). The last row quantifies the probability of an error occurs when determining that the mean of a subset S1 or S2 compared with the test mean is different.