Title :
Reflections on the NASA MDP data sets
Author :
Gray, D. ; Bowes, D. ; Davey, Neil ; Sun, Yue ; Christianson, Bruce
Author_Institution :
Comput. Sci. Dept., Univ. of Hertfordshire, Hatfield, UK
Abstract :
Background: The NASA metrics data program (MDP) data sets have been heavily used in software defect prediction research. Aim: To highlight the data quality issues present in these data sets, and the problems that can arise when they are used in a binary classification context. Method: A thorough exploration of all 13 original NASA data sets, followed by various experiments demonstrating the potential impact of duplicate data points when data mining. Conclusions: Firstly researchers need to analyse the data that forms the basis of their findings in the context of how it will be used. Secondly, the bulk of defect prediction experiments based on the NASA MDP data sets may have led to erroneous findings. This is mainly because of repeated/duplicate data points potentially causing substantial amounts of training and testing data to be identical.
Keywords :
data mining; pattern classification; software metrics; software reliability; NASA MDP data set; National Aeronautics and Space Administration; binary classification context; data mining; data quality issue; defect prediction experiment; duplicate data point; metrics data program; repeated data point; software defect prediction research;
Journal_Title :
Software, IET
DOI :
10.1049/iet-sen.2011.0132