DocumentCode :
1735317
Title :
Comparative Analysis on the Stability of Feature Selection Techniques Using Three Frameworks on Biological Datasets
Author :
Wald, Randall ; Khoshgoftaar, Taghi ; Shanab, Ahmad Abu ; Napolitano, Antonio
Author_Institution :
Florida Atlantic Univ., Boca Raton, FL, USA
Volume :
1
fYear :
2013
Firstpage :
418
Lastpage :
423
Abstract :
Feature (gene) selection is a common preprocessing technique used to counter the problem of high dimensionality(too many independent features) found in many bioinformaticsdatasets, addressing this problem by creating a smaller feature subset including only the most important features. Although feature selection techniques are often evaluated based on how they can help improve classification performance, it is also important to find stable feature selection techniques which will give consistent results even in the face of dataset perturbations(such as class noise or sampling used to alleviate the problem of imbalanced data). This is especially important in bioinformatics, where the prime concern may be gene discovery rather than classification. In this study we use three frameworks to evaluate the stability of gene selection techniques: "sampledcleanvs. sampled-clean, " "sampled-noisy vs. sampled-noisy, " and" sampled-clean vs. sampled-noisy." All frameworks involve pairwisecomparisons among the results from the perturbed datasets(due to sampling or class noise injection followed by sampling). They differ in terms of whether they observe how sampling can create variation within the feature subsets (sampled-clean vs. sampled-clean), how noisy datasets (which were then sampled)can create a wide spread of selected features (sampled-noisyvs. sampled-noisy), or how features selected on clean and noisy datasets differ, after both datasets have been sampled (sampledcleanvs. sampled-noisy). Along with these three frameworks, our comparison of seven feature ranking techniques uses four cancer gene datasets, applies three sampling techniques, and generates artificial class noise to better simulate real-world datasets. The results from the frameworks are generally similar, with Signal-To-Noise and ReliefF showing the best stability and Gain Ratio showing the worst across all three frameworks, although Relief-W is notable for showing moderate to above-average stability when the- clean datasets are used, but giving the second worst performance when noise was present.
Keywords :
bioinformatics; genetics; sampling methods; Relief-W; artificial class noise; bioinformatics; biological datasets; cancer gene dataset; feature ranking technique; feature selection technique; feature subset; gain ratio; gene discovery; sampling technique; Cancer; Gene expression; Lungs; Noise; Noise measurement; Stability criteria; Feature Selection; Imbalanced Data; Noise Injection; Stability;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Machine Learning and Applications (ICMLA), 2013 12th International Conference on
Conference_Location :
Miami, FL
Type :
conf
DOI :
10.1109/ICMLA.2013.85
Filename :
6784655
Link To Document :
بازگشت