Title :
Efficient progressive sampling for association rules
Author :
Parthasarathy, Srinivasan
Author_Institution :
Dept. of Comput. & Inf. Sci., Ohio State Univ., Columbus, OH, USA
Abstract :
In data mining, sampling has often been suggested as an effective tool to reduce the size of the dataset operated at some cost to accuracy. However this loss to accuracy is often difficult to measure and characterize since the exact nature of the learning curve (accuracy vs. sample size) is parameter and data dependent, i.e., we do not know a priori what sample size is needed to achieve a desired accuracy on a particular dataset for a particular set of parameters. In this article we propose the use of progressive sampling, to determine the required sample size for association rule mining. We first show that a naive application of progressive sampling is not very efficient for association rule mining. We then present a refinement based on equivalence classes, that seems to work extremely well in practice and is able to converge to the desired sample size very quickly and very accurately. An additional novelty of our approach is the definition of a support-sensitive, interactive measure of accuracy across progressive samples.
Keywords :
data mining; equivalence classes; fractals; association rules; data mining; dataset; equivalence classes; progressive sampling; rule mining; Association rules; Costs; Data mining; Databases; Delay; Information science; Loss measurement; Pressing; Sampling methods; Size measurement;
Conference_Titel :
Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on
Print_ISBN :
0-7695-1754-4
DOI :
10.1109/ICDM.2002.1183923