Title of article :
Reducing over-optimism in variable selection by cross-model validation
Author/Authors :
Anderssen، نويسنده , , Endre and Dyrstad، نويسنده , , Knut and Westad، نويسنده , , Frank and Martens، نويسنده , , Harald، نويسنده ,
Issue Information :
دوفصلنامه با شماره پیاپی سال 2006
Abstract :
Extensive optimisation of a mathematical modelʹs fit to a relatively small set of empirical data, may lead to over-optimistic validation results. If the assessment of the final, optimised model is based on the same validation method and the same input data that were used as basis for the extensive model optimisation, accumulated spurious correlations may appear as real predictive ability in the final model validation. An example of this is the use of extensive variable selection in multiple regression, based on a cross-model validation scheme.
ustrate the over-optimism problem in optimisation based on conventional one-layered validation, an artificial data set, with only random numbers was submitted to regression modelling. The model was optimised by stepwise variable selection. A very good apparent predictive ability for y from X was found in the final model by leave-one-out cross-validation (84%), after the number of X-variables had been reduced stepwise from 500 to 29. Finally, the performance of the cross-model validation is tested on one large QSAR data set. Several calibration sets were chosen randomly and a regression model optimised by variable selection. The prediction accuracy of these models was compared to the cross-validation and cross-model validation results. In these tests cross-model validation gives the better measure of model predictive ability.
Keywords :
Jack-knifing , QSAR , Regression , Over-fitting , Cross-model validation , variable selection
Journal title :
Chemometrics and Intelligent Laboratory Systems
Journal title :
Chemometrics and Intelligent Laboratory Systems