• DocumentCode
    3123452
  • Title

    Distinct Counting with a Self-Learning Bitmap

  • Author

    Chen, Aiyou ; Cao, Jin

  • Author_Institution
    Bell Labs., Alcatel-Lucent, Paris
  • fYear
    2009
  • fDate
    March 29 2009-April 2 2009
  • Firstpage
    1171
  • Lastpage
    1174
  • Abstract
    Estimating the number of distinct values is a fundamental problem in database that has attracted extensive research over the past two decades, due to its wide applications (especially in the Internet). Many algorithms have been proposed via sampling or sketching for obtaining statistical estimates that only require limited computing and memory resources. However, their performance in terms of relative estimation accuracy usually depends on the unknown cardinalities. In this paper, we address the following question: can a distinct counting algorithm have uniformly reliable performance, i.e. constant relative estimation errors for unknown cardinalities in a wide range, say from tens to millions? We propose a self-learning bitmap algorithm (S-bitmap) to answer this question. The S-bitmap is a bitmap obtained via a novel adaptive sampling process, where the bits corresponding to the sampled items are set to 1, and the sampling rates are learned from the number of distinct items already passed and reduced sequentially as more bits are set to 1. A unique property of S-bitmap is that its relative estimation error is truly stabilized, i.e. invariant to unknown cardinalities in a prescribed range. We demonstrate through both theoretical and empirical studies that with a given memory requirement, S-bitmap is not only uniformly reliable but more accurate than state-of-the-art algorithms such as the multiresolution bitmap and Hyper LogLog algorithms under common practice settings.
  • Keywords
    database theory; set theory; statistical analysis; Hyper LogLog algorithms; Internet; adaptive sampling process; multiresolution bitmap; relative estimation; self-learning bitmap; Data engineering; Databases; Estimation error; Internet; Monitoring; Query processing; Reliability theory; Sampling methods; Statistical distributions; Telecommunication traffic; bitmap; distinct counting; sampling; streaming data; uniform reliability;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering, 2009. ICDE '09. IEEE 25th International Conference on
  • Conference_Location
    Shanghai
  • ISSN
    1084-4627
  • Print_ISBN
    978-1-4244-3422-0
  • Electronic_ISBN
    1084-4627
  • Type

    conf

  • DOI
    10.1109/ICDE.2009.193
  • Filename
    4812493