DocumentCode :
2548793
Title :
Comparison of various methods for handling incomplete data in software engineering databases
Author :
Twala, Bhekisipho ; Cartwright, Michelle ; Shepperd, Martin
Author_Institution :
Brunel Univ., Uxbridge, UK
fYear :
2005
fDate :
17-18 Nov. 2005
Abstract :
Increasing the awareness of how missing data affects software predictive accuracy has led to increasing numbers of missing data techniques (MDTs). This paper investigates the robustness and accuracy of eight popular techniques for tolerating incomplete training and test data using tree-based models. MDTs were compared by artificially simulating different proportions, patterns, and mechanisms of missing data. A 4-way repeated measures design was employed to analyze the data. The simulation results suggest important differences. Listwise deletion is substantially inferior while multiple imputation (MI) represents a superior approach to handling missing data. Decision tree single imputation and surrogate variables splitting are more severely impacted by missing values distributed among all attributes. MI should be used if the data contain many missing values. If few values are missing, any of the MDTs might be considered. Choice of technique should be guided by pattern and mechanisms of missing data.
Keywords :
data handling; database management systems; decision trees; software fault tolerance; software performance evaluation; data handling; decision tree; imputation representation; missing data technique; software engineering database; software predictive accuracy; tree-based model; Accuracy; Data analysis; Databases; Decision trees; Machine learning; Machine learning algorithms; Robustness; Software engineering; Software quality; Testing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Empirical Software Engineering, 2005. 2005 International Symposium on
Print_ISBN :
0-7803-9507-7
Type :
conf
DOI :
10.1109/ISESE.2005.1541819
Filename :
1541819
Link To Document :
بازگشت