Title :
Contrasting Undersampled Boosting with Internal and External Feature Selection for Patient Response Datasets
Author :
Khoshgoftaar, Taghi M. ; Dittman, David J. ; Wald, Randall ; Napolitano, Antonio
Author_Institution :
Florida Atlantic Univ., Boca Raton, FL, USA
Abstract :
Class imbalance (where one class has many more instances than the other class(es)) and high dimensionality (large number of features per instance) are two prevalent problems that are frequently present in patient response datasets. In addition to these problems, these datasets are notoriously difficult to build effective models from. This paper introduces a new hybrid boosting algorithm named SelectRUSBoost which combines data sampling and feature selection with every iteration of boosting. We test SelectRUSBoost along with RUSBoost combined with external feature selection on a set of five patient response datasets. In addition to the datasets we also utilize two classifiers, three filter-based feature selection techniques, and four feature subset sizes. Our results show that SelectRUSBoost will, with few exceptions, outperform RUSBoost combined with external feature selection. Also, the feature selection technique information gain outperformed the other techniques for all combinations of boosting approach, classifier, and feature subset size, and in addition for this feature selection technique SelectRUSBoost always (without exception) outperformed RUSBoost combined with external selection. Statistical analysis confirmed that SelectRUSBoost gives better performance than RUSBoost combined with external selection. This is the first work which utilizes SelectRUSBoost in a bioinformatics study.
Keywords :
bioinformatics; data mining; feature selection; learning (artificial intelligence); patient treatment; pattern classification; sampling methods; RUSBoost; SelectRUSBoost; bioinformatics; class imbalance; data sampling; feature classifier; feature subset size; filter-based feature selection technique; hybrid boosting algorithm; iteration method; patient response datasets; undersampled boosting; Bioinformatics; Boosting; Buildings; DNA; Data models; Logistics; Stability analysis; Bioinformatics; Boosting; Class Imbalance; Feature Selection; High Dimensionality; Patient Response;
Conference_Titel :
Machine Learning and Applications (ICMLA), 2013 12th International Conference on
Conference_Location :
Miami, FL
DOI :
10.1109/ICMLA.2013.156