• DocumentCode
    1679124
  • Title

    Attribute Selection and Imbalanced Data: Problems in Software Defect Prediction

  • Author

    Khoshgoftaar, Taghi M. ; Gao, Kehan ; Seliya, Naeem

  • Author_Institution
    Florida Atlantic Univ., Boca Raton, FL, USA
  • Volume
    1
  • fYear
    2010
  • Firstpage
    137
  • Lastpage
    144
  • Abstract
    The data mining and machine learning community is often faced with two key problems: working with imbalanced data and selecting the best features for machine learning. This paper presents a process involving a feature selection technique for selecting the important attributes and a data sampling technique for addressing class imbalance. The application domain of this study is software engineering, more specifically, software quality prediction using classification models. When using feature selection and data sampling together, different scenarios should be considered. The four possible scenarios are: (1) feature selection based on original data, and modeling (defect prediction) based on original data; (2) feature selection based on original data, and modeling based on sampled data; (3) feature selection based on sampled data, and modeling based on original data; and (4) feature selection based on sampled data, and modeling based on sampled data. The research objective is to compare the software defect prediction performances of models based on the four scenarios. The case study consists of nine software measurement data sets obtained from the PROMISE software project repository. Empirical results suggest that feature selection based on sampled data performs significantly better than feature selection based on original data, and that defect prediction models perform similarly regardless of whether the training data was formed using sampled or original data.
  • Keywords
    data mining; learning (artificial intelligence); program debugging; software engineering; attribute selection; classification models; data mining; feature selection technique; imbalanced data; machine learning community; software defect prediction; software engineering; software measurement data sets; software quality prediction; Analysis of variance; Data models; Predictive models; Software; Software metrics; Training data; data sampling; defect prediction; feature selection; software measurements;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Tools with Artificial Intelligence (ICTAI), 2010 22nd IEEE International Conference on
  • Conference_Location
    Arras
  • ISSN
    1082-3409
  • Print_ISBN
    978-1-4244-8817-9
  • Type

    conf

  • DOI
    10.1109/ICTAI.2010.27
  • Filename
    5670030