DocumentCode :
231259
Title :
A Two-Stage Data Preprocessing Approach for Software Fault Prediction
Author :
Jiaqiang Chen ; Shulong Liu ; Wangshu Liu ; Xiang Chen ; Qing Gu ; Daoxu Chen
Author_Institution :
State Key Lab. for Novel Software Technol., Nanjing Univ., Nanjing, China
fYear :
2014
fDate :
June 30 2014-July 2 2014
Firstpage :
20
Lastpage :
29
Abstract :
Software fault prediction is valuable in predicting fault proneness of software modules and then limited test resources can be effectively allocated for software quality assurance. Researchers have proved that either feature selection or instance reduction can improve the performance of classification models used for fault prediction. However, to the best of our knowledge, few researchers have combined them to study the effects on classification models. Therefore we propose a novel two-stage data preprocessing approach, which incorporates both feature selection and instance reduction. In particular, in the feature selection stage, we propose a new algorithm using both feature selection and threshold-based clustering which contains both relevance analysis and redundancy control. Then in the instance reduction stage, we apply random sampling to keep the balance between the faulty and non-faulty classes. In empirical studies, we implemented five different data preprocessing schemes based on our proposed approach, and performed a comparative study on the prediction performance of the commonly used classification models. The final results demonstrate the effectiveness of our approach and further provide a guideline for achieving cost-effective data preprocessing when using our approach.
Keywords :
feature selection; pattern classification; pattern clustering; program testing; random processes; sampling methods; software fault tolerance; software quality; classification models; fault proneness prediction; feature selection; instance reduction; nonfaulty classes; prediction performance; random sampling; redundancy control; relevance analysis; software fault prediction; software modules; software quality assurance; test resources; threshold-based clustering; two-stage data preprocessing approach; Algorithm design and analysis; Clustering algorithms; Data models; Data preprocessing; Redundancy; Software; Training data; Feature Selection; Instance Reduction; Redundancy Control; Relevance Analysis; Software Fault Prediction;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Software Security and Reliability (SERE), 2014 Eighth International Conference on
Conference_Location :
San Francisco, CA
Print_ISBN :
978-1-4799-4296-1
Type :
conf
DOI :
10.1109/SERE.2014.15
Filename :
6895412
Link To Document :
بازگشت