Title :
A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches
Author :
Galar, Mikel ; Fernández, Alberto ; Barrenechea, Edurne ; Bustince, Humberto ; Herrera, Francisco
Author_Institution :
Dept. of Autom. y Comput., Univ. Publica de Navarra, Pamplona, Spain
fDate :
7/1/2012 12:00:00 AM
Abstract :
Classifier learning with data-sets that suffer from imbalanced class distributions is a challenging problem in data mining community. This issue occurs when the number of examples that represent one class is much lower than the ones of the other classes. Its presence in many real-world applications has brought along a growth of attention from researchers. In machine learning, the ensemble of classifiers are known to increase the accuracy of single classifiers by combining several of them, but neither of these learning techniques alone solve the class imbalance problem, to deal with this issue the ensemble learning algorithms have to be designed specifically. In this paper, our aim is to review the state of the art on ensemble techniques in the framework of imbalanced data-sets, with focus on two-class problems. We propose a taxonomy for ensemble-based methods to address the class imbalance where each proposal can be categorized depending on the inner ensemble methodology in which it is based. In addition, we develop a thorough empirical comparison by the consideration of the most significant published approaches, within the families of the taxonomy proposed, to show whether any of them makes a difference. This comparison has shown the good behavior of the simplest approaches which combine random undersampling techniques with bagging or boosting ensembles. In addition, the positive synergy between sampling techniques and bagging has stood out. Furthermore, our results show empirically that ensemble-based algorithms are worthwhile since they outperform the mere use of preprocessing techniques before learning the classifier, therefore justifying the increase of complexity by means of a significant enhancement of the results.
Keywords :
data mining; learning (artificial intelligence); pattern classification; bagging-based approach; boosting-based approach; class imbalance problem; classifier ensemble; classifier learning; data mining community; ensemble learning algorithms; ensemble-based algorithms; ensemble-based method taxonomy; hybrid-based approach; imbalanced data-sets; inner ensemble methodology; machine learning; preprocessing techniques; random undersampling techniques; two-class problems; Accuracy; Algorithm design and analysis; Bagging; Learning systems; Noise; Proposals; Training; Bagging; boosting; class distribution; classification; ensembles; imbalanced data-sets; multiple classifier systems;
Journal_Title :
Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on
DOI :
10.1109/TSMCC.2011.2161285