DocumentCode
1961318
Title
A generalization of blocking and windowing algorithms for duplicate detection
Author
Draisbach, Uwe ; Naumann, Felix
Author_Institution
Hasso Plattner Inst., Potsdam, Germany
fYear
2011
fDate
6-6 Sept. 2011
Firstpage
18
Lastpage
24
Abstract
Duplicate detection is the process of finding multiple records in a dataset that represent the same real-world entity. Due to the enormous costs of an exhaustive comparison, typical algorithms select only promising record pairs for comparison. Two competing approaches are blocking and windowing. Blocking methods partition records into disjoint subsets, while windowing methods, in particular the Sorted Neighborhood Method, slide a window over the sorted records and compare records only within the window. We present a new algorithm called Sorted Blocks in several variants, which generalizes both approaches. To evaluate Sorted Blocks, we have conducted extensive experiments with different datasets. These show that our new algorithm needs fewer comparisons to find the same number of duplicates.
Keywords
database management systems; sorting; blocking algorithms; duplicate detection; sorted blocks; sorted neighborhood method; windowing algorithms;
fLanguage
English
Publisher
ieee
Conference_Titel
Data and Knowledge Engineering (ICDKE), 2011 International Conference on
Conference_Location
Milan
Print_ISBN
978-1-4577-0865-7
Type
conf
DOI
10.1109/ICDKE.2011.6053920
Filename
6053920
Link To Document