Comparisons of classification methods for screening potential compounds

Author

An, Aijun ; Wang, Yuanyuan

Author_Institution

Dept. of Comput. Sci., York Univ., Toronto, Ont., Canada

fYear

2001

fDate

2001

Firstpage

11

Lastpage

18

Abstract

We compare a number of data mining and statistical methods on the drug design problem of modeling molecular structure-activity relationships. The relationships can be used to identify active compounds based on their chemical structures from a large inventory of chemical compounds. The data set of this application has a highly skewed class distribution, in which only 2% of the compounds are considered active. We apply a number of classification methods to this extremely imbalanced data set and propose to use different performance measures to evaluate these methods. We report our findings on the characteristics of the performance measures, the effect of using pruning techniques in this application and a comparison of local learning methods with global techniques. We also investigate whether reducing the imbalance in the training data by up-sampling or down-sampling would improve the predictive performance

Keywords

chemistry computing; data mining; learning (artificial intelligence); pattern classification; pharmaceutical industry; active compounds; chemical compounds; chemical structures; classification methods; data mining; data set; down-sampling; drug design problem; global techniques; highly skewed class distribution; imbalanced data set; local learning methods; molecular structure-activity relationships; performance measures; potential compound screening; predictive performance; pruning techniques; statistical methods; training data; up-sampling; Chemical compounds; Computer science; Data mining; Drugs; High temperature superconductors; Human immunodeficiency virus; Protection; Statistics; Testing; Throughput;

fLanguage

English

Publisher

ieee

Conference_Titel

Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on

Conference_Location

San Jose, CA

Print_ISBN

0-7695-1119-8

Type

conf

DOI

10.1109/ICDM.2001.989495

Filename

989495