DocumentCode :
1831607
Title :
Gene selection stability´s dependence on dataset difficulty
Author :
Dittman, David J. ; Khoshgoftaar, Taghi ; Wald, Randall ; Napolitano, Antonio
Author_Institution :
Florida Atlantic Univ., Boca Raton, FL, USA
fYear :
2013
fDate :
14-16 Aug. 2013
Firstpage :
341
Lastpage :
348
Abstract :
Identifying important biomarkers to improve disease diagnosis and treatment is a significant topic of research in bioinformatics. However, bioinformatics datasets frequently have a large number of features per sample or instance. This problem, known as “high dimensionality,” can be alleviated through the use of dimension reducing techniques such as feature (gene) selection which remove unnecessary features. There are many versions of feature selection, with varying biases and predictive abilities. However, predictive power is but one factor to consider when choosing a feature selection technique: one must also consider the technique´s stability, that is, its ability to create feature subsets which remain valid in the face of changes to the data. While there has been work in determining the relative stability of different feature selection techniques, this does not always help determine whether a chosen feature selection technique will give stable feature subsets for a specific dataset. Factors such as difficulty of learning (e.g., dataset difficulty) may also influence feature selection stability, making generally-true facts about different techniques not applicable to a given dataset. In this work, we study how dataset difficulty can affect the stability of feature selection techniques, leading to good performance from bad techniques and vice versa. We use a set of twenty-six DNA microarray datasets with varying levels of difficulty of learning, along with four levels of dataset perturbation, six feature selection techniques with various levels of stability, and twelve feature subset sizes. The results show that as the dataset difficulty increases, the stability decreases. However, the relative stability between the techniques remains the same. Additionally, the more difficult the dataset, the more the stability is affected by changes to the data. We also found that unstable rankers are more affected by the transition between Easy and Moderate dat- sets, whereas the stable techniques are more affected by the change between Moderate and Hard datasets. Lastly, as the feature subset size increases, the stability increases and the difference between the levels of dataset difficulty decreases. Overall, we conclude that difficulty of learning must be taken into account before interpreting stability results.
Keywords :
bioinformatics; data handling; genetics; molecular biophysics; bioinformatics; biomarker identification; dataset difficulty; dimension reduction techniques; gene feature selection stability; Bioinformatics; Computational efficiency; DNA; Indexes; Measurement; Stability criteria; DNA Microarray; Difficulty of Learning; Feature Selection; Stability;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Information Reuse and Integration (IRI), 2013 IEEE 14th International Conference on
Conference_Location :
San Francisco, CA
Type :
conf
DOI :
10.1109/IRI.2013.6642491
Filename :
6642491
Link To Document :
بازگشت