DocumentCode :
1961318
Title :
A generalization of blocking and windowing algorithms for duplicate detection
Author :
Draisbach, Uwe ; Naumann, Felix
Author_Institution :
Hasso Plattner Inst., Potsdam, Germany
fYear :
2011
fDate :
6-6 Sept. 2011
Firstpage :
18
Lastpage :
24
Abstract :
Duplicate detection is the process of finding multiple records in a dataset that represent the same real-world entity. Due to the enormous costs of an exhaustive comparison, typical algorithms select only promising record pairs for comparison. Two competing approaches are blocking and windowing. Blocking methods partition records into disjoint subsets, while windowing methods, in particular the Sorted Neighborhood Method, slide a window over the sorted records and compare records only within the window. We present a new algorithm called Sorted Blocks in several variants, which generalizes both approaches. To evaluate Sorted Blocks, we have conducted extensive experiments with different datasets. These show that our new algorithm needs fewer comparisons to find the same number of duplicates.
Keywords :
database management systems; sorting; blocking algorithms; duplicate detection; sorted blocks; sorted neighborhood method; windowing algorithms;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data and Knowledge Engineering (ICDKE), 2011 International Conference on
Conference_Location :
Milan
Print_ISBN :
978-1-4577-0865-7
Type :
conf
DOI :
10.1109/ICDKE.2011.6053920
Filename :
6053920
Link To Document :
بازگشت