• DocumentCode
    3600015
  • Title

    Near-Optimal Approximate Duplicate-Detection in Data Streams Over Sliding Windows for the Uniform Query Frequency or Membership Likelihood

  • Author

    Xiujun Wang ; Xiao Zheng ; Zhe Dang ; Xuangou Wu ; Baohua Zhao

  • Author_Institution
    Anhui Univ. of Technol., Maanshan, China
  • fYear
    2014
  • Firstpage
    122
  • Lastpage
    127
  • Abstract
    Approximate duplicate-detection (or membership query) in data streams answers the question of whether an element from a large universe U (a query element) is present in a small subsequence of a data stream or not. It is an important query that has many Internet applications, such as web crawling, social networks and so on. Existing approximate duplicate-detection methods in the sliding window model are not memory-efficient, since that they don´t incorporate the information on the query frequencies and membership likelihoods of the elements in a large universe U into their data structure design, while the information can be obtained with well-developed technique. In this paper, assuming that either the query frequency or membership likelihood is uniform for all elements in U, we adopt a block-wise updating strategy to design an memory-efficient data structure, called cell Bloom filter (CEBF), and an approximate duplicate-detection algorithm based on CEBF. Suppose that the average false positive rate is ε and the sliding window size is n, then the number of bits used by our method is 2 log2(e)n(log2 1/ε+1), which is much less than those of other existing algorithms. Experimental results on synthetic data verify the effectiveness of our method.
  • Keywords
    Internet; data structures; query processing; question answering (information retrieval); CEBF; Internet; block-wise updating strategy; cell Bloom filter; data streams; membership likelihood; memory-efficient data structure design; near-optimal approximate duplicate-detection method; sliding window model; uniform query frequency; Algorithm design and analysis; Approximation algorithms; Data models; Data structures; Electronic mail; Internet; Xenon;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Advanced Cloud and Big Data (CBD), 2014 Second International Conference on
  • Print_ISBN
    978-1-4799-8086-4
  • Type

    conf

  • DOI
    10.1109/CBD.2014.54
  • Filename
    7176081