• DocumentCode
    1199687
  • Title

    Quality-Aware Sampling and Its Applications in Incremental Data Mining

  • Author

    Chuang, Kun-Ta ; Lin, Keng-Pei ; Chen, Ming-Syan

  • Author_Institution
    Graduate Inst. of Commun. Eng., Nat. Taiwan Univ., Taipei
  • Volume
    19
  • Issue
    4
  • fYear
    2007
  • fDate
    4/1/2007 12:00:00 AM
  • Firstpage
    468
  • Lastpage
    484
  • Abstract
    We explore in this paper a novel sampling algorithm, referred to as algorithm PAS (standing for proportion approximation sampling), to generate a high-quality online sample with the desired sample rate. The sampling quality refers to the consistency between the population proportion and the sample proportion of each categorical value in the database. Note that the state-of-the-art sampling algorithm to preserve the sampling quality has to examine the population proportion of each categorical value in a pilot sample a priori and is thus not applicable to incremental mining applications. To remedy this, algorithm PAS adaptively determines the inclusion probability of each incoming tuple in such a way that the sampling quality can be sequential/preserved while also guaranteeing the sample rate close to the user specified one. Importantly, PAS not only guarantees the proportion consistency of each categorical value but also excellently preserves the proportion consistency of multivariate statistics, which will be significantly beneficial to various data mining applications. For better execution efficiency, we further devise an algorithm, called algorithm EQAS (standing for efficient quality-aware sampling), which integrates PAS and random sampling to provide the flexibility of striking a compromise between the sampling quality and the sampling efficiency. As validated in experimental results on real and synthetic data, algorithm PAS can stably provide high-quality samples with corresponding computational overhead, whereas algorithm EQAS can flexibly generate samples with the desired balance between sampling quality and sampling efficiency
  • Keywords
    data mining; random processes; sampling methods; efficient quality-aware sampling; inclusion probability; incremental data mining; population proportion consistency; proportion approximation sampling; random sampling; state-of-the-art sampling algorithm; Approximation algorithms; Computational efficiency; Data mining; Databases; Degradation; Probability; Sampling methods; Size measurement; Statistics; Sequential sampling; incremental data mining.;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2007.1005
  • Filename
    4118705