Title :
Comparisons of classification methods for screening potential compounds
Author :
An, Aijun ; Wang, Yuanyuan
Author_Institution :
Dept. of Comput. Sci., York Univ., Toronto, Ont., Canada
Abstract :
We compare a number of data mining and statistical methods on the drug design problem of modeling molecular structure-activity relationships. The relationships can be used to identify active compounds based on their chemical structures from a large inventory of chemical compounds. The data set of this application has a highly skewed class distribution, in which only 2% of the compounds are considered active. We apply a number of classification methods to this extremely imbalanced data set and propose to use different performance measures to evaluate these methods. We report our findings on the characteristics of the performance measures, the effect of using pruning techniques in this application and a comparison of local learning methods with global techniques. We also investigate whether reducing the imbalance in the training data by up-sampling or down-sampling would improve the predictive performance
Keywords :
chemistry computing; data mining; learning (artificial intelligence); pattern classification; pharmaceutical industry; active compounds; chemical compounds; chemical structures; classification methods; data mining; data set; down-sampling; drug design problem; global techniques; highly skewed class distribution; imbalanced data set; local learning methods; molecular structure-activity relationships; performance measures; potential compound screening; predictive performance; pruning techniques; statistical methods; training data; up-sampling; Chemical compounds; Computer science; Data mining; Drugs; High temperature superconductors; Human immunodeficiency virus; Protection; Statistics; Testing; Throughput;
Conference_Titel :
Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on
Conference_Location :
San Jose, CA
Print_ISBN :
0-7695-1119-8
DOI :
10.1109/ICDM.2001.989495