DocumentCode :
244728
Title :
A new sampling approach for classification of imbalanced data sets with high density
Author :
Jia Pengfei ; Zhang Chunkai ; He Zhenyu
Author_Institution :
Shenzhen Grad. Sch., Harbin Inst. of Technol., Shenzhen, China
fYear :
2014
fDate :
15-17 Jan. 2014
Firstpage :
217
Lastpage :
222
Abstract :
Class imbalance of datasets is a common problem in the field of machine learning. In recent years, because the traditional classifier algorithms are designed only for balanced cases, these classifiers always achieved poor performance in imbalanced data classification issues, especially for the imbalanced data with a really high density. This paper introduces the importance of imbalanced data classification in various fields first; then, contends existing methods of solving the imbalanced data classification problem; finally, proposes two new sampling methods, which are based on borderline-SMOTE, for the imbalanced data with high density, especially for big data with this kind of distribution feature. These two new algorithms are not only over-sampling the minority samples near the borderline, but also creating appropriate synthetic samples in the majority class samples side and under-sampling some particular majority class samples. Experiments show that these two algorithms could achieve a better performance than random over sampling, SMOTE (Synthetic minority over-sampling technique) and Borderline-SMOTE in AUC (Area under Receiver Operating Characteristics Curve) metric evaluate method, when the sampling rate makes the majority class and minority class samples approximate equilibrium.
Keywords :
learning (artificial intelligence); pattern classification; sampling methods; AUC metric evaluate method; Borderline-SMOTE; area under receiver operating characteristics curve; big data; distribution feature; imbalanced data classification issues; machine learning; majority class samples side; sampling approach; Breast; Classification algorithms; Distributed databases; Information management; Prediction algorithms; Sampling methods; Training; big data; classification; high density; imbalanced data; sampling method;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Big Data and Smart Computing (BIGCOMP), 2014 International Conference on
Conference_Location :
Bangkok
Type :
conf
DOI :
10.1109/BIGCOMP.2014.6741439
Filename :
6741439
Link To Document :
بازگشت