Title :
Cluster-based majority under-sampling approaches for class imbalance learning
Author :
Zhang, Yan-ping ; Zhang, Li-Na ; Wang, Yong-Cheng
Author_Institution :
Sch. of Comput. Sci. & Technol., Anhui Univ., Hefei, China
Abstract :
The class imbalance problem usually occurs in real applications. The class imbalance is that the amount of one class may be much less than that of another in training set. Under-sampling is a very popular approach to deal with this problem. Under-sampling approach is very efficient, it only using a subset of the majority class. The drawback of under-sampling is that it throws away many potentially useful majority class examples. To overcome this drawback, we adopt an unsupervised learning technique for supervised learning. We proposes cluster-based majority under-sampling approaches for selecting a representative subset from the majority class. Compared to under-sampling, cluster-based under-sampling can effectively avoid the important information loss of majority class. We adopt two methods to select representative subset from k clusters with certain proportions, and then use the representative subset and the all minority class samples as training data to improve accuracy over minority and majority classes. In the paper, we compared the behaviors of our approaches with the traditional random under-sampling approach on ten UCI repository datasets using the following classifiers: k-nearest neighbor and Naïve Bayes classifier. Recall, Precision, F-measure, G-mean and BACC (balance accuracy) are used for evaluating performance of classifiers. Experimental results show that our cluster-based majority under-sampling approaches outperform the random under-sampling approach. Our approaches attain better overall performance on k-nearest neighbor classifier compared to Naïve Bayes classifier.
Keywords :
Bayes methods; pattern classification; pattern clustering; sampling methods; unsupervised learning; F-measure; G-mean; Naive Bayes classifier; class imbalance learning; cluster based under sampling approach; k-nearest neighbor; supervised learning; unsupervised learning technique; Accuracy; Classification algorithms; Conferences; Data mining; Learning; Machine learning; Training; class imbalance learning; classification; clustering; under-sampling;
Conference_Titel :
Information and Financial Engineering (ICIFE), 2010 2nd IEEE International Conference on
Conference_Location :
Chongqing
Print_ISBN :
978-1-4244-6927-7
DOI :
10.1109/ICIFE.2010.5609385