Efficient progressive sampling for association rules

Author

Parthasarathy, Srinivasan

Author_Institution

Dept. of Comput. & Inf. Sci., Ohio State Univ., Columbus, OH, USA

fYear

2002

fDate

2002

Firstpage

354

Lastpage

361

Abstract

In data mining, sampling has often been suggested as an effective tool to reduce the size of the dataset operated at some cost to accuracy. However this loss to accuracy is often difficult to measure and characterize since the exact nature of the learning curve (accuracy vs. sample size) is parameter and data dependent, i.e., we do not know a priori what sample size is needed to achieve a desired accuracy on a particular dataset for a particular set of parameters. In this article we propose the use of progressive sampling, to determine the required sample size for association rule mining. We first show that a naive application of progressive sampling is not very efficient for association rule mining. We then present a refinement based on equivalence classes, that seems to work extremely well in practice and is able to converge to the desired sample size very quickly and very accurately. An additional novelty of our approach is the definition of a support-sensitive, interactive measure of accuracy across progressive samples.

Keywords

data mining; equivalence classes; fractals; association rules; data mining; dataset; equivalence classes; progressive sampling; rule mining; Association rules; Costs; Data mining; Databases; Delay; Information science; Loss measurement; Pressing; Sampling methods; Size measurement;

fLanguage

English

Publisher

ieee

Conference_Titel

Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on

Print_ISBN

0-7695-1754-4

Type

conf

DOI

10.1109/ICDM.2002.1183923

Filename

1183923