Title :
Near-Optimal Approximate Duplicate-Detection in Data Streams Over Sliding Windows for the Uniform Query Frequency or Membership Likelihood
Author :
Xiujun Wang ; Xiao Zheng ; Zhe Dang ; Xuangou Wu ; Baohua Zhao
Author_Institution :
Anhui Univ. of Technol., Maanshan, China
Abstract :
Approximate duplicate-detection (or membership query) in data streams answers the question of whether an element from a large universe U (a query element) is present in a small subsequence of a data stream or not. It is an important query that has many Internet applications, such as web crawling, social networks and so on. Existing approximate duplicate-detection methods in the sliding window model are not memory-efficient, since that they don´t incorporate the information on the query frequencies and membership likelihoods of the elements in a large universe U into their data structure design, while the information can be obtained with well-developed technique. In this paper, assuming that either the query frequency or membership likelihood is uniform for all elements in U, we adopt a block-wise updating strategy to design an memory-efficient data structure, called cell Bloom filter (CEBF), and an approximate duplicate-detection algorithm based on CEBF. Suppose that the average false positive rate is ε and the sliding window size is n, then the number of bits used by our method is 2 log2(e)n(log2 1/ε+1), which is much less than those of other existing algorithms. Experimental results on synthetic data verify the effectiveness of our method.
Keywords :
Internet; data structures; query processing; question answering (information retrieval); CEBF; Internet; block-wise updating strategy; cell Bloom filter; data streams; membership likelihood; memory-efficient data structure design; near-optimal approximate duplicate-detection method; sliding window model; uniform query frequency; Algorithm design and analysis; Approximation algorithms; Data models; Data structures; Electronic mail; Internet; Xenon;
Conference_Titel :
Advanced Cloud and Big Data (CBD), 2014 Second International Conference on
Print_ISBN :
978-1-4799-8086-4
DOI :
10.1109/CBD.2014.54