An Empirical Study on the Stability of Feature Selection for Imbalanced Software Engineering Data

Author

Huanjing Wang ; Khoshgoftaar, Taghi M. ; Napolitano, Antonio

Volume

1

fYear

2012

fDate

12-15 Dec. 2012

Firstpage

317

Lastpage

323

Abstract

In software quality modeling, software metrics are collected during the software development cycle. However, not all metrics are relevant to the class attribute (software quality). Metric (feature) selection has become the cornerstone of many software quality classification problems. Selecting software metrics that are important for software quality classification is a necessary and critical step before the model training process. Recently, the robustness (e.g., stability) of feature selection techniques has been studied, to examine the sensitivity of these techniques to changes (adding/removing program modules to/from their dataset). This work provides an empirical study regarding the stability of feature selection techniques across six software metrics datasets with varying levels of class balance. In this work eighteen feature selection techniques are evaluated. Moreover, three factors, feature subset size, degree of perturbation, and class balance of datasets, are considered in this study to evaluate stability of feature selection techniques. Experimental results show that these factors affect the stability of feature selection techniques as one might expect. We found that with few exceptions, feature ranking based on highly imbalanced datasets are less stable than based on slightly imbalanced data. Results also show that making smaller changes to the datasets has less impact on the stability of feature ranking techniques. Overall, we conclude that a careful understanding of one´s dataset (and certain choices of metric selection technique) can help practitioners build more reliable software quality models.

Keywords

pattern classification; software metrics; software quality; feature ranking; feature selection stability; imbalanced software engineering data; model training process; software development cycle; software metrics; software quality classification problems; software quality modeling; Indexes; Measurement; Radio frequency; Software quality; Stability criteria; feature ranking; imbalanced data; stability; subsample;

fLanguage

English

Publisher

ieee

Conference_Titel

Machine Learning and Applications (ICMLA), 2012 11th International Conference on

Conference_Location

Boca Raton, FL

Print_ISBN

978-1-4673-4651-1

Type

conf

DOI

10.1109/ICMLA.2012.60

Filename

6406682