Title :
Robust identification of fuzzy duplicates
Author :
Chaudhuri, Surajit ; Ganti, Venkatesh ; Motwani, Rajeev
Author_Institution :
Microsoft Res., USA
Abstract :
Detecting and eliminating fuzzy duplicates is a critical data cleaning task that is required by many applications. Fuzzy duplicates are multiple seemingly distinct tuples, which represent the same real-world entity. We propose two novel criteria that enable characterization of fuzzy duplicates more accurately than is possible with existing techniques. Using these criteria, we propose a novel framework for the fuzzy duplicate elimination problem. We show that solutions within the new framework result in better accuracy than earlier approaches. We present an efficient algorithm for solving instantiations within the framework. We evaluate it on real datasets to demonstrate the accuracy and scalability of our algorithm.
Keywords :
data analysis; data mining; relational databases; data cleaning task; fuzzy duplicate detection; fuzzy duplicate elimination problem; fuzzy duplicate robust identification; Catalogs; Cleaning; Clustering algorithms; Costs; Couplings; Data mining; Partitioning algorithms; Robustness; Scalability; Training data;
Conference_Titel :
Data Engineering, 2005. ICDE 2005. Proceedings. 21st International Conference on
Print_ISBN :
0-7695-2285-8
DOI :
10.1109/ICDE.2005.125