Title :
An Evaluation of Progressive Sampling for Imbalanced Data Sets
Author :
Ng, Willie ; Dash, Manoranjan
Author_Institution :
Centre for Adv. Inf. Syst., Nanyang Technol. Univ., Singapore
Abstract :
One of the emerging challenges for the data mining research community is to allow learning algorithms to mine huge databases. Sampling has often been suggested as an effective way to circumvent memory limitations as well as to improve processing speed. In this paper, we study the learning-curve sampling method, an approach for applying machine learning algorithms to massive amount of data sets. We show that a naive application of progressive sampling on data sets with highly imbalanced class distributions is often not very effective for training a learning algorithm. We then present a refinement for progressive sampling which works well in practice and is able to converge to the desired sample size very quickly and accurately. Empirical results on a number of large data sets show that our approach is able to enhance its performance
Keywords :
data mining; learning (artificial intelligence); sampling methods; very large databases; data mining; learning-curve sampling; machine learning algorithms; progressive sampling; unbalanced data sets; Algorithm design and analysis; Convergence; Costs; Data engineering; Data mining; Databases; Information systems; Machine learning; Machine learning algorithms; Sampling methods;
Conference_Titel :
Data Mining Workshops, 2006. ICDM Workshops 2006. Sixth IEEE International Conference on
Conference_Location :
Hong Kong
Print_ISBN :
0-7695-2702-7
DOI :
10.1109/ICDMW.2006.28