DocumentCode :
268117
Title :
OligoIS: Scalable Instance Selection for Class-Imbalanced Data Sets
Author :
García-Pedrajas, Nicolás ; Perez-Rodríguez, Javier ; de Haro-García, Aida
Author_Institution :
Dept. of Comput. & Numerical Anal., Univ. of Cordoba, Cordoba, Spain
Volume :
43
Issue :
1
fYear :
2013
fDate :
Feb. 2013
Firstpage :
332
Lastpage :
346
Abstract :
In current research, an enormous amount of information is constantly being produced, which poses a challenge for data mining algorithms. Many of the problems in extremely active research areas, such as bioinformatics, security and intrusion detection, or text mining, share the following two features: large data sets and class-imbalanced distribution of samples. Although many methods have been proposed for dealing with class-imbalanced data sets, most of these methods are not scalable to the very large data sets common to those research fields. In this paper, we propose a new approach to dealing with the class-imbalance problem that is scalable to data sets with many millions of instances and hundreds of features. This proposal is based on the divide-and-conquer principle combined with application of the selection process to balanced subsets of the whole data set. This divide-and-conquer principle allows the execution of the algorithm in linear time. Furthermore, the proposed method is easy to implement using a parallel environment and can work without loading the whole data set into memory. Using 40 class-imbalanced medium-sized data sets, we will demonstrate our method´s ability to improve the results of state-of-the-art instance selection methods for class-imbalanced data sets. Using three very large data sets, we will show the scalability of our proposal to millions of instances and hundreds of features.
Keywords :
data mining; divide and conquer methods; sampling methods; OligoIS; class-imbalance problem; class-imbalanced data sets; class-imbalanced medium-sized data sets; class-imbalanced sample distribution; data mining algorithms; divide-and-conquer principle; large data set sample distribution; scalable instance selection; state-of-the-art instance selection methods; Accuracy; Approximation algorithms; Blades; Evolutionary computation; Proposals; Scalability; Training; Class-imbalance problem; instance selection; instance-based learning; very large problems;
fLanguage :
English
Journal_Title :
Cybernetics, IEEE Transactions on
Publisher :
ieee
ISSN :
2168-2267
Type :
jour
DOI :
10.1109/TSMCB.2012.2206381
Filename :
6253271
Link To Document :
بازگشت