DocumentCode
1831607
Title
Gene selection stability´s dependence on dataset difficulty
Author
Dittman, David J. ; Khoshgoftaar, Taghi ; Wald, Randall ; Napolitano, Antonio
Author_Institution
Florida Atlantic Univ., Boca Raton, FL, USA
fYear
2013
fDate
14-16 Aug. 2013
Firstpage
341
Lastpage
348
Abstract
Identifying important biomarkers to improve disease diagnosis and treatment is a significant topic of research in bioinformatics. However, bioinformatics datasets frequently have a large number of features per sample or instance. This problem, known as “high dimensionality,” can be alleviated through the use of dimension reducing techniques such as feature (gene) selection which remove unnecessary features. There are many versions of feature selection, with varying biases and predictive abilities. However, predictive power is but one factor to consider when choosing a feature selection technique: one must also consider the technique´s stability, that is, its ability to create feature subsets which remain valid in the face of changes to the data. While there has been work in determining the relative stability of different feature selection techniques, this does not always help determine whether a chosen feature selection technique will give stable feature subsets for a specific dataset. Factors such as difficulty of learning (e.g., dataset difficulty) may also influence feature selection stability, making generally-true facts about different techniques not applicable to a given dataset. In this work, we study how dataset difficulty can affect the stability of feature selection techniques, leading to good performance from bad techniques and vice versa. We use a set of twenty-six DNA microarray datasets with varying levels of difficulty of learning, along with four levels of dataset perturbation, six feature selection techniques with various levels of stability, and twelve feature subset sizes. The results show that as the dataset difficulty increases, the stability decreases. However, the relative stability between the techniques remains the same. Additionally, the more difficult the dataset, the more the stability is affected by changes to the data. We also found that unstable rankers are more affected by the transition between Easy and Moderate dat- sets, whereas the stable techniques are more affected by the change between Moderate and Hard datasets. Lastly, as the feature subset size increases, the stability increases and the difference between the levels of dataset difficulty decreases. Overall, we conclude that difficulty of learning must be taken into account before interpreting stability results.
Keywords
bioinformatics; data handling; genetics; molecular biophysics; bioinformatics; biomarker identification; dataset difficulty; dimension reduction techniques; gene feature selection stability; Bioinformatics; Computational efficiency; DNA; Indexes; Measurement; Stability criteria; DNA Microarray; Difficulty of Learning; Feature Selection; Stability;
fLanguage
English
Publisher
ieee
Conference_Titel
Information Reuse and Integration (IRI), 2013 IEEE 14th International Conference on
Conference_Location
San Francisco, CA
Type
conf
DOI
10.1109/IRI.2013.6642491
Filename
6642491
Link To Document