DocumentCode
244728
Title
A new sampling approach for classification of imbalanced data sets with high density
Author
Jia Pengfei ; Zhang Chunkai ; He Zhenyu
Author_Institution
Shenzhen Grad. Sch., Harbin Inst. of Technol., Shenzhen, China
fYear
2014
fDate
15-17 Jan. 2014
Firstpage
217
Lastpage
222
Abstract
Class imbalance of datasets is a common problem in the field of machine learning. In recent years, because the traditional classifier algorithms are designed only for balanced cases, these classifiers always achieved poor performance in imbalanced data classification issues, especially for the imbalanced data with a really high density. This paper introduces the importance of imbalanced data classification in various fields first; then, contends existing methods of solving the imbalanced data classification problem; finally, proposes two new sampling methods, which are based on borderline-SMOTE, for the imbalanced data with high density, especially for big data with this kind of distribution feature. These two new algorithms are not only over-sampling the minority samples near the borderline, but also creating appropriate synthetic samples in the majority class samples side and under-sampling some particular majority class samples. Experiments show that these two algorithms could achieve a better performance than random over sampling, SMOTE (Synthetic minority over-sampling technique) and Borderline-SMOTE in AUC (Area under Receiver Operating Characteristics Curve) metric evaluate method, when the sampling rate makes the majority class and minority class samples approximate equilibrium.
Keywords
learning (artificial intelligence); pattern classification; sampling methods; AUC metric evaluate method; Borderline-SMOTE; area under receiver operating characteristics curve; big data; distribution feature; imbalanced data classification issues; machine learning; majority class samples side; sampling approach; Breast; Classification algorithms; Distributed databases; Information management; Prediction algorithms; Sampling methods; Training; big data; classification; high density; imbalanced data; sampling method;
fLanguage
English
Publisher
ieee
Conference_Titel
Big Data and Smart Computing (BIGCOMP), 2014 International Conference on
Conference_Location
Bangkok
Type
conf
DOI
10.1109/BIGCOMP.2014.6741439
Filename
6741439
Link To Document