DocumentCode :
2006163
Title :
Comparison of Four Performance Metrics for Evaluating Sampling Techniques for Low Quality Class-Imbalanced Data
Author :
Folleco, Andres ; Khoshgoftaar, Taghi M. ; Napolitano, Amri
Author_Institution :
Florida Atlantic Univ., Boca Raton, FL
fYear :
2008
fDate :
11-13 Dec. 2008
Firstpage :
153
Lastpage :
158
Abstract :
Erroneous attribute values can significantly impact learning from otherwise valuable data. The learning impact can be exacerbated by the class imbalanced training data. We investigate and compare the overall learning impact of sampling such data by using four distinct performance metrics suitable for models built from binary class imbalanced data. Seven relatively free of noise, class imbalanced software engineering measurement datasets were used. A novel noise injection procedure was applied to these datasets. We injected domain realistic noise into the independent and dependent (class) attributes of randomly selected instances to simulate lower quality measurement data. Seven well known data sampling techniques with the benchmark decision-tree learner C4.5 were used. No other related studies were found that have comprehensively investigated learning by sampling low quality binary class imbalanced data containing both independent and dependent corrupted attributes. Two sampling techniques (random undersampling and Wilson´s editing) with better and more robust learning performances were identified. In contrast, all metrics concurred on the identification of the worst performing sampling technique (cluster-based oversampling).
Keywords :
data analysis; decision trees; learning (artificial intelligence); software engineering; erroneous attribute values; imbalanced software engineering measurement; impact learning; low quality class-imbalanced data; noise injection procedure; performance metrics; quality binary class imbalanced data; sampling techniques; Decision trees; Design for experiments; Machine learning; Noise measurement; Noise robustness; Sampling methods; Software engineering; Software measurement; Software quality; Training data; class imbalance; data quality; data sampling; performance metrics; simulated noise;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Machine Learning and Applications, 2008. ICMLA '08. Seventh International Conference on
Conference_Location :
San Diego, CA
Print_ISBN :
978-0-7695-3495-4
Type :
conf
DOI :
10.1109/ICMLA.2008.11
Filename :
4724969
Link To Document :
بازگشت