DocumentCode :
2005876
Title :
Learning-Based Fusion for Data Deduplication
Author :
Dinerstein, Jared ; Dinerstein, Sabra ; Egbert, Parris K. ; Clyde, Stephen W.
Author_Institution :
Utah State Univ., Logan, UT, USA
fYear :
2008
fDate :
11-13 Dec. 2008
Firstpage :
66
Lastpage :
71
Abstract :
Rule-based deduplication utilizes expert domain knowledge to identify and remove duplicate data records. Achieving high accuracy in a rule-based system requires the creation of rules containing a good combination of discriminatory clues. Unfortunately, accurate rule-based deduplication often requires significant manual tuning of both the rules and the corresponding thresholds. This need for manual tuning reduces the efficacy of rule-based deduplication and its applicability to real-world data sets. No adequate solution exists for this problem. We propose a novel technique for rule-based deduplication. We apply individual deduplication rules, and combine the resultant match scores via learning-based information fusion. We show empirically that our fused deduplication technique achieves higher average accuracy than traditional rule-based deduplication. Further, our technique alleviates the need for manual tuning of the deduplication rules and corresponding thresholds.
Keywords :
database management systems; knowledge based systems; learning (artificial intelligence); sensor fusion; expert domain knowledge; learning-based information fusion; rule-based data deduplication; Atomic measurements; Computer errors; Data models; Databases; Knowledge based systems; Machine intelligence; Machine learning; Manuals; Support vector machines; XML; information fusion; rule-based data deduplication; supervised learning;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Machine Learning and Applications, 2008. ICMLA '08. Seventh International Conference on
Conference_Location :
San Diego, CA
Print_ISBN :
978-0-7695-3495-4
Type :
conf
DOI :
10.1109/ICMLA.2008.83
Filename :
4724957
Link To Document :
بازگشت