DocumentCode
179732
Title
Under-sampling by algorithm with performance guaranteed for class-imbalance problem
Author
Jindaluang, Wattana ; Chouvatut, Varin ; Kantabutra, Sanpawat
Author_Institution
Dept. of Comput. Sci., Chiang Mai Univ., Chiang Mai, Thailand
fYear
2014
fDate
July 30 2014-Aug. 1 2014
Firstpage
215
Lastpage
221
Abstract
Class-imbalance problem is the problem that the number, or data, in the majority class is much more than in the minority class. Traditional classifiers cannot sort out this problem because they focus on the data in the majority class than on the data in the minority class, and then they predict some upcoming data as the data in the majority class. Under-sampling is an efficient way to handle this problem because this method selects the representatives of the data in the majority class. For this reason, under-sampling occupies shorter training period than over-sampling. The only problem with the under-sampling method is that a representative selection, in all probability, throws away important information in a majority class. To overcome this problem, we propose a cluster-based under-sampling method. We use a clustering algorithm that is performance guaranteed, named k-centers algorithm, which clusters the data in the majority class and selects a number of representative data in many proportions, and then combines them with all the data in the minority class as a training set. In this paper, we compare our approach with k-means on five datasets from UCI with two classifiers: 5-nearest neighbors and c4.5 decision tree. The performance is measured by Precision, Recall, F-measure, and Accuracy. The experimental results show that our approach has higher measurements than the k-means approach, except Precision where both the approaches have the same rate.
Keywords
decision trees; pattern classification; pattern clustering; sampling methods; 5-nearest neighbor classifier; UCI; c4.5 decision tree classifier; class-imbalance problem; cluster-based under-sampling method; clustering algorithm; k-centers algorithm; majority class; minority class; over-sampling method; Accuracy; Classification algorithms; Clustering algorithms; Computer science; Decision trees; Sampling methods; Training; class-imbalance problem; classification; k-centers algorithm; under-sampling;
fLanguage
English
Publisher
ieee
Conference_Titel
Computer Science and Engineering Conference (ICSEC), 2014 International
Conference_Location
Khon Kaen
Print_ISBN
978-1-4799-4965-6
Type
conf
DOI
10.1109/ICSEC.2014.6978197
Filename
6978197
Link To Document