A new sampling approach for classification of imbalanced data sets with high density

Author

Jia Pengfei ; Zhang Chunkai ; He Zhenyu

Author_Institution

Shenzhen Grad. Sch., Harbin Inst. of Technol., Shenzhen, China

fYear

2014

fDate

15-17 Jan. 2014

Firstpage

217

Lastpage

222

Abstract

Class imbalance of datasets is a common problem in the field of machine learning. In recent years, because the traditional classifier algorithms are designed only for balanced cases, these classifiers always achieved poor performance in imbalanced data classification issues, especially for the imbalanced data with a really high density. This paper introduces the importance of imbalanced data classification in various fields first; then, contends existing methods of solving the imbalanced data classification problem; finally, proposes two new sampling methods, which are based on borderline-SMOTE, for the imbalanced data with high density, especially for big data with this kind of distribution feature. These two new algorithms are not only over-sampling the minority samples near the borderline, but also creating appropriate synthetic samples in the majority class samples side and under-sampling some particular majority class samples. Experiments show that these two algorithms could achieve a better performance than random over sampling, SMOTE (Synthetic minority over-sampling technique) and Borderline-SMOTE in AUC (Area under Receiver Operating Characteristics Curve) metric evaluate method, when the sampling rate makes the majority class and minority class samples approximate equilibrium.

Keywords

learning (artificial intelligence); pattern classification; sampling methods; AUC metric evaluate method; Borderline-SMOTE; area under receiver operating characteristics curve; big data; distribution feature; imbalanced data classification issues; machine learning; majority class samples side; sampling approach; Breast; Classification algorithms; Distributed databases; Information management; Prediction algorithms; Sampling methods; Training; big data; classification; high density; imbalanced data; sampling method;

fLanguage

English

Publisher

ieee

Conference_Titel

Big Data and Smart Computing (BIGCOMP), 2014 International Conference on

Conference_Location

Bangkok

Type

conf

DOI

10.1109/BIGCOMP.2014.6741439

Filename

6741439