DocumentCode
2006163
Title
Comparison of Four Performance Metrics for Evaluating Sampling Techniques for Low Quality Class-Imbalanced Data
Author
Folleco, Andres ; Khoshgoftaar, Taghi M. ; Napolitano, Amri
Author_Institution
Florida Atlantic Univ., Boca Raton, FL
fYear
2008
fDate
11-13 Dec. 2008
Firstpage
153
Lastpage
158
Abstract
Erroneous attribute values can significantly impact learning from otherwise valuable data. The learning impact can be exacerbated by the class imbalanced training data. We investigate and compare the overall learning impact of sampling such data by using four distinct performance metrics suitable for models built from binary class imbalanced data. Seven relatively free of noise, class imbalanced software engineering measurement datasets were used. A novel noise injection procedure was applied to these datasets. We injected domain realistic noise into the independent and dependent (class) attributes of randomly selected instances to simulate lower quality measurement data. Seven well known data sampling techniques with the benchmark decision-tree learner C4.5 were used. No other related studies were found that have comprehensively investigated learning by sampling low quality binary class imbalanced data containing both independent and dependent corrupted attributes. Two sampling techniques (random undersampling and Wilson´s editing) with better and more robust learning performances were identified. In contrast, all metrics concurred on the identification of the worst performing sampling technique (cluster-based oversampling).
Keywords
data analysis; decision trees; learning (artificial intelligence); software engineering; erroneous attribute values; imbalanced software engineering measurement; impact learning; low quality class-imbalanced data; noise injection procedure; performance metrics; quality binary class imbalanced data; sampling techniques; Decision trees; Design for experiments; Machine learning; Noise measurement; Noise robustness; Sampling methods; Software engineering; Software measurement; Software quality; Training data; class imbalance; data quality; data sampling; performance metrics; simulated noise;
fLanguage
English
Publisher
ieee
Conference_Titel
Machine Learning and Applications, 2008. ICMLA '08. Seventh International Conference on
Conference_Location
San Diego, CA
Print_ISBN
978-0-7695-3495-4
Type
conf
DOI
10.1109/ICMLA.2008.11
Filename
4724969
Link To Document