DocumentCode
2468154
Title
Evaluating feature selection strategies for high dimensional, small sample size datasets
Author
Golugula, Abhishek ; Lee, George ; Madabhushi, Anant
Author_Institution
Department of Electrical and Computer Engineering, Rutgers University, Piscataway, New Jersey 08854
fYear
2011
fDate
Aug. 30 2011-Sept. 3 2011
Firstpage
949
Lastpage
952
Abstract
In this work, we analyze and evaluate different strategies for comparing Feature Selection (FS) schemes on High Dimensional (HD) biomedical datasets (e.g. gene and protein expression studies) with a small sample size (SSS). Additionally, we define a new feature, Robustness, specifically for comparing the ability of an FS scheme to be invariant to changes in its training data. While classifier accuracy has been the de facto method for evaluating FS schemes, on account of the curse of dimensionality problem, it might not always be the appropriate measure for HD/SSS datasets. SSS lends the dataset a higher probability of containing data that is not representative of the true distribution of the whole population. However, an ideal FS scheme must be robust enough to produce the same results each time there are changes to the training data. In this study, we employed the robustness performance measure in conjunction with classifier accuracy (measured via the K-Nearest Neighbor and Random Forest classifiers) to quantitatively compare five different FS schemes (T-test, F-test, Kolmogorov-Smirnov Test, Wilks Lambda Test and Wilcoxon Rand Sum Test) on 5 HD/SSS gene and protein expression datasets corresponding to ovarian cancer, lung cancer, bone lesions, celiac disease, and coronary heart disease. Of the five FS schemes compared, the Wilcoxon Rand Sum Test was found to outperform other FS schemes in terms of classification accuracy and robustness. Our results suggest that both classifier accuracy and robustness should be considered when deciding on the appropriate FS scheme for HD/SSS datasets.
Keywords
Accuracy; Cancer; Diseases; Feature extraction; High definition video; Proteins; Robustness; Algorithms; Animals; Data Mining; Databases, Factual; Gene Expression Profiling; Humans; Neoplasm Proteins; Neoplasms; Pattern Recognition, Automated; Signal Transduction;
fLanguage
English
Publisher
ieee
Conference_Titel
Engineering in Medicine and Biology Society, EMBC, 2011 Annual International Conference of the IEEE
Conference_Location
Boston, MA
ISSN
1557-170X
Print_ISBN
978-1-4244-4121-1
Electronic_ISBN
1557-170X
Type
conf
DOI
10.1109/IEMBS.2011.6090214
Filename
6090214
Link To Document