• DocumentCode
    1490618
  • Title

    Combating the Small Sample Class Imbalance Problem Using Feature Selection

  • Author

    Wasikowski, Mike ; Chen, Xue-wen

  • Author_Institution
    US Army Training & Doctrine Command Anal. Center, Fort Leavenworth, KS, USA
  • Volume
    22
  • Issue
    10
  • fYear
    2010
  • Firstpage
    1388
  • Lastpage
    1400
  • Abstract
    The class imbalance problem is encountered in real-world applications of machine learning and results in a classifier´s suboptimal performance. Researchers have rigorously studied the resampling, algorithms, and feature selection approaches to this problem. No systematic studies have been conducted to understand how well these methods combat the class imbalance problem and which of these methods best manage the different challenges posed by imbalanced data sets. In particular, feature selection has rarely been studied outside of text classification problems. Additionally, no studies have looked at the additional problem of learning from small samples. This paper presents a first systematic comparison of the three types of methods developed for imbalanced data classification problems and of seven feature selection metrics evaluated on small sample data sets from different applications. We evaluated the performance of these metrics using area under the receiver operating characteristic (AUC) and area under the precision-recall curve (PRC). We compared each metric on the average performance across all problems and on the likelihood of a metric yielding the best performance on a specific problem. We examined the performance of these metrics inside each problem domain. Finally, we evaluated the efficacy of these metrics to see which perform best across algorithms. Our results showed that signal-to-noise correlation coefficient (S2N) and Feature Assessment by Sliding Thresholds (FAST) are great candidates for feature selection in most applications, especially when selecting very small numbers of features.
  • Keywords
    feature extraction; learning (artificial intelligence); pattern classification; sampling methods; sensitivity analysis; text analysis; classifier suboptimal performance; feature assessment by sliding thresholds; feature selection metrics; imbalanced data classification; machine learning; metric performance evaluation; precision recall curve; receiver operating characteristic; signal to noise correlation coefficient; small sample class imbalance problem; text classification; Accuracy; Machine learning; Machine learning algorithms; Partial response channels; Pattern recognition; Performance evaluation; Support vector machine classification; Support vector machines; Text categorization; Training data; Class imbalance problem; bioinformatics; feature evaluation and selection; machine learning; pattern recognition; text mining.;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2009.187
  • Filename
    5276797