DocumentCode :
2848578
Title :
Robust identification of fuzzy duplicates
Author :
Chaudhuri, Surajit ; Ganti, Venkatesh ; Motwani, Rajeev
Author_Institution :
Microsoft Res., USA
fYear :
2005
fDate :
5-8 April 2005
Firstpage :
865
Lastpage :
876
Abstract :
Detecting and eliminating fuzzy duplicates is a critical data cleaning task that is required by many applications. Fuzzy duplicates are multiple seemingly distinct tuples, which represent the same real-world entity. We propose two novel criteria that enable characterization of fuzzy duplicates more accurately than is possible with existing techniques. Using these criteria, we propose a novel framework for the fuzzy duplicate elimination problem. We show that solutions within the new framework result in better accuracy than earlier approaches. We present an efficient algorithm for solving instantiations within the framework. We evaluate it on real datasets to demonstrate the accuracy and scalability of our algorithm.
Keywords :
data analysis; data mining; relational databases; data cleaning task; fuzzy duplicate detection; fuzzy duplicate elimination problem; fuzzy duplicate robust identification; Catalogs; Cleaning; Clustering algorithms; Costs; Couplings; Data mining; Partitioning algorithms; Robustness; Scalability; Training data;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Engineering, 2005. ICDE 2005. Proceedings. 21st International Conference on
ISSN :
1084-4627
Print_ISBN :
0-7695-2285-8
Type :
conf
DOI :
10.1109/ICDE.2005.125
Filename :
1410199
Link To Document :
بازگشت