Title :
Learning from combination of data chunks for multi-class imbalanced data
Author :
Xu-Ying Liu ; Qian-Qian Li
Author_Institution :
Key Lab. of Comput. Network & Inf. Integration, Southeast Univ., Nanjing, China
Abstract :
Class-imbalance is very common in real-world applications. Previous studies focused on binary-class imbalance problem, whereas multi-class imbalance problem is more general and more challenging. Under-sampling is an effective and efficient method for binary-class imbalanced data. But when it is used for multi-class imbalanced data, many more majority class examples are ignored because there are often multiple majority classes, and the minority class often has few data. To utilize the information contained in the majority class examples ignored by under-sampling, this paper proposes a method ChunkCombine. For each majority class, it performs under-sampling multiple times to obtained non-overlapping data chunks, such that they contain the most information that a data sample of the same size can contain. Each data chunk has the same size as the minority class to achieve balance. Then every possible combination of the minority class and each data chunk from every majority class forms a balanced training set. ChunkCombine uses ensemble techniques to learn from the different training sets derived from all the possible combinations. Experimental results show it is better than many other popular methods for multi-class imbalanced data when average accuracy, G-mean and MAUC are used as evaluation measures. Besides, we discuss different evaluation measures and suggest that, a multi-class F-measure Mean F-Measure (MFM) is unsuitable for multi-class imbalanced data in many situations because it is not consistent with the standard F-measure in binary-class case and it is close to accuracy.
Keywords :
learning (artificial intelligence); pattern classification; set theory; ChunkCombine method; G-mean; MAUC; average accuracy; binary-class imbalance problem; multiclass AdaBoost classifiers; multiclass F-measure mean F-measure; multiclass imbalanced data; nonoverlapping data chunks; training sets; under-sampling; Accuracy; Boosting; Educational institutions; Feature extraction; Standards; Training;
Conference_Titel :
Neural Networks (IJCNN), 2014 International Joint Conference on
Conference_Location :
Beijing
Print_ISBN :
978-1-4799-6627-1
DOI :
10.1109/IJCNN.2014.6889667