• DocumentCode
    561177
  • Title

    Impact of Noise and Data Sampling on Stability of Feature Selection

  • Author

    Shanab, Ahmad Abu ; Khoshgoftaar, Taghi M. ; Wald, Randall

  • Author_Institution
    Florida Atlantic Univ., Boca Raton, FL, USA
  • Volume
    1
  • fYear
    2011
  • fDate
    18-21 Dec. 2011
  • Firstpage
    172
  • Lastpage
    177
  • Abstract
    High dimensionality is one of the major problems in data mining, occurring when there is a large abundance of attributes. One common technique used to alleviate high dimensionality is feature selection, the process of selecting the most relevant attributes and removing irrelevant and redundant ones. Much research has been done towards evaluating the performance of classifiers before and after feature selection, but little work has been done examining how sensitive the selected feature subsets are to changes (additions/deletions) in the dataset. In this study we evaluate the robustness of six commonly used feature selection techniques, investigating the impact of data sampling and class noise on the stability of feature selection. All experiments are carried out with six commonly used feature rankers on four groups of datasets from the biology domain. We employ three sampling techniques, and generate artificial class noise to better simulate real-world datasets. The results demonstrate that although no ranker consistently outperforms the others, Gain Ratio shows the least stability on average. Additional tests using our feature rankers for building classification models also show that a feature ranker´s stability is not an indicator of its performance in classification.
  • Keywords
    data mining; noise; pattern classification; sampling methods; stability; artificial class noise; biology dataset; classification model; data mining; data sampling technique; feature ranker; feature selection stability; feature selection technique; gain ratio; high dimensionality problem; Gene expression; Niobium; Noise; Noise measurement; Radio frequency; Stability criteria; bioinformatics; class imbalance; classification; feature selection; noise injection; stability;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Machine Learning and Applications and Workshops (ICMLA), 2011 10th International Conference on
  • Conference_Location
    Honolulu, HI
  • Print_ISBN
    978-1-4577-2134-2
  • Type

    conf

  • DOI
    10.1109/ICMLA.2011.74
  • Filename
    6146964