Title :
A generalization of blocking and windowing algorithms for duplicate detection
Author :
Draisbach, Uwe ; Naumann, Felix
Author_Institution :
Hasso Plattner Inst., Potsdam, Germany
Abstract :
Duplicate detection is the process of finding multiple records in a dataset that represent the same real-world entity. Due to the enormous costs of an exhaustive comparison, typical algorithms select only promising record pairs for comparison. Two competing approaches are blocking and windowing. Blocking methods partition records into disjoint subsets, while windowing methods, in particular the Sorted Neighborhood Method, slide a window over the sorted records and compare records only within the window. We present a new algorithm called Sorted Blocks in several variants, which generalizes both approaches. To evaluate Sorted Blocks, we have conducted extensive experiments with different datasets. These show that our new algorithm needs fewer comparisons to find the same number of duplicates.
Keywords :
database management systems; sorting; blocking algorithms; duplicate detection; sorted blocks; sorted neighborhood method; windowing algorithms;
Conference_Titel :
Data and Knowledge Engineering (ICDKE), 2011 International Conference on
Conference_Location :
Milan
Print_ISBN :
978-1-4577-0865-7
DOI :
10.1109/ICDKE.2011.6053920