Title :
Feature Selection with Imbalanced Data for Software Defect Prediction
Author :
Khoshgoftaar, Taghi M. ; Gao, Kehan
Author_Institution :
Florida Atlantic Univ., Boca Raton, FL, USA
Abstract :
In this paper, we study the learning impact of data sampling followed by attribute selection on the classification models built with binary class imbalanced data within the scenario of software quality engineering. We use a wrapper-based attribute ranking technique to select a subset of attributes, and the random undersampling technique (RUS) on the majority class to alleviate the negative effects of imbalanced data on the prediction models. The datasets used in the empirical study were collected from numerous software projects. Five data preprocessing scenarios were explored in these experiments, including: (1) training on the original, unaltered fit dataset, (2) training on a sampled version of the fit dataset, (3) training on an unsampled version of the fit dataset using only the attributes chosen by feature selection based on the unsampled fit dataset, (4) training on an unsampled version of the fit dataset using only the attributes chosen by feature selection based on a sampled version of the fit dataset, and (5) training on a sampled version of the fit dataset using only the attributes chosen by feature selection based on the sampled version of the fit dataset. We compared the performances of the classification models constructed over these five different scenarios. The results demonstrate that the classification models constructed on the sampled fit data with or without feature selection (case 2 and case 5) significantly outperformed the classification models built with the other cases (unsampled fit data). Moreover, the two scenarios using sampled data (case 2 and case 5) showed very similar performances, but the subset of attributes (case 5) is only around 15% or 30% of the complete set of attributes (case 2).
Keywords :
fault diagnosis; software fault tolerance; software quality; attribute selection; binary class imbalanced data; classification model; data sampling; feature selection; learning impact; random undersampling technique; software defect prediction; software quality engineering; wrapper-based attribute ranking; Application software; Data engineering; Data mining; Data preprocessing; Machine learning; Predictive models; Project management; Sampling methods; Software measurement; Software quality; feature selection; imbalanced data; software defect prediction; wrapper-based attribute ranking;
Conference_Titel :
Machine Learning and Applications, 2009. ICMLA '09. International Conference on
Conference_Location :
Miami Beach, FL
Print_ISBN :
978-0-7695-3926-3
DOI :
10.1109/ICMLA.2009.18