INVESTIGADORES
SOTO Axel Juan
congresos y reuniones científicas
Título:
On Designing Confident Statistical QSPR Models
Autor/es:
AXEL JUAN SOTO; IGNACIO PONZONI; GUSTAVO ESTEBAN VAZQUEZ
Lugar:
Los Cocos, Provincia de Córdoba, Argentina
Reunión:
Conferencia; 1° Reunión Internacional de Ciencias Farmaceúticas; 2010
Institución organizadora:
Facultad de Ciencias Bioquímicas y Farmaceúticas, Universidad Nacional de Rosario
Resumen:
Modern drug discovery is differentiated by activity or property prediction techniques that allow virtual screening of leads. In this regard, QSPR (Quantitative Structure-Property Relationship) methods aim at modeling a biological or physicochemical property from its molecular structure. Although during last decade the number of papers in this subject is high, prediction capacity of QSAR models still remains to be improved 1. In this paper we present a number of caveats for avoiding common errors on statistical models for QSPR methods. In addition, we also propose different alternatives to address these issues.First of all, any prediction method must have a clear validation procedure. The best validation procedure consists in using an additional source of data that was separated from the very beginning from the training data used for constructing the model. This testing data should be used only one time, otherwise it might be misleading. When the number of available compounds is not large enough---this is the more typical situation---a cross-validation procedure should be carried out, in order to have an unbiased estimation of the prediction capacity. K-fold crossvalidation techniques are far better than the widespread leave-one out (LOO) procedure, which is overoptimistic compared to the testing prediction capacity.Many new statistical and machine learning methods are commonly applied for inferring structure-activity relationships. These methods might be a double-edged weapon, since they allow to fit any non-linear relationship. However, it is easy to fit noise or find chance correlations2.Another crucial decision is the selection of the subset of descriptors that are relevant to the activity or property under study. This selection determines the prediction capacity and the degree of interpretability of the model.Materials and methodsWe propose three different models based on a neural network method. Model 1 has relevant descriptors and an appropriate training. Model 2 was overfitted. Model 3 was appended with 10 more random descriptors. Prediction capacities are evaluated using the training set, a testing set (with 30% of the initial data) and LOO. We used a database of 439 compounds with information of experimental logP values.ResultsThe table shows prediction capacity using the abovementioned strategies and evaluation methods. MSE (Mean Square Error) and r2 (coefficient of determination) are considered as goodness of fit metrics.ConclusionsRigorous design and comprehensive validation of QSPR models are essential for obtaining confident methods. Training or LOO evaluations are not representative enough for reporting true prediction capacity. Even though these caveats are not sufficient for a good QSPR method, we believe that they constitute necessary design decisions.AcknowledgmentsThis work is supported by grants CONICET PIP11220090100322 and UNS PGI 24/ZN15 & 24/ZN16.