Title of article :
An empirical study of the classification performance of learners on imbalanced and noisy software quality data
Author/Authors :
Chris Seiffert، نويسنده , , Taghi M. Khoshgoftaar، نويسنده , , Jason Van Hulse، نويسنده , , Andres Folleco، نويسنده ,
Issue Information :
روزنامه با شماره پیاپی سال 2014
Pages :
25
From page :
571
To page :
595
Abstract :
Data mining techniques are commonly used to construct models for identifying software modules that are most likely to contain faults. In doing so, an organization’s limited resources can be intelligently allocated with the goal of detecting and correcting the greatest number of faults. However, there are two characteristics of software quality datasets that can negatively impact the effectiveness of these models: class imbalance and class noise. Software quality datasets are, by their nature, imbalanced. That is, most of a software system’s faults can be found in a small percentage of software modules. Therefore, the number of fault-prone, fp, examples (program modules) in a software project dataset is much smaller than the number of not fault-prone, nfp, examples. Data sampling techniques attempt to alleviate the problem of class imbalance by altering a training dataset’s distribution. A program module contains class noise if it is incorrectly labeled. While several studies have been performed to evaluate data sampling methods, the impact of class noise on these techniques has not been adequately addressed. This work presents a systematic set of experiments designed to investigate the impact of both class noise and class imbalance on classification models constructed to identify fault-prone program modules. We analyze the impact of class noise and class imbalance on 11 different learning algorithms (learners) as well as 7 different data sampling techniques. We identify which learners and which data sampling techniques are most robust when confronted with noisy and imbalanced data.
Keywords :
Binary classification , Class noise , Imbalance , sampling
Journal title :
Information Sciences
Serial Year :
2014
Journal title :
Information Sciences
Record number :
1216007
Link To Document :
بازگشت