DocumentCode
2043477
Title
An Efficient Duplicate Detection System for XML Documents
Author
Lwin, Thandar ; Nyunt, Thi Thi Soe
Author_Institution
Univ. of Comput. Studies, Yangon, Myanmar
Volume
2
fYear
2010
fDate
19-21 March 2010
Firstpage
178
Lastpage
182
Abstract
Duplicate detection, which is an important subtask of data cleaning, is the task of identifying multiple representations of a same real-world object and necessary to improve data quality. Numerous approaches both for relational and XML data exist. As XML becomes increasingly popular for data exchange and data publishing on the Web, algorithms to detect duplicates in XML documents are required. Previous domain independent solutions to this problem relied on standard textual similarity functions (e.g., edit distance, cosine metric) between objects. However, such approaches result in large numbers of false positives if we want to identify domain-specific abbreviations and conventions. In this paper, we present the process of detecting duplicate includes three modules, such as selector, preprocessor and duplicate identifier which uses XML documents and candidate definition as input and produces duplicate objects as output. The aim of this research is to develop an efficient algorithm for detecting duplicate in complex XML documents and to reduce number of false positive by using MD5 algorithm. We illustrate the efficiency of this approach on several real-world datasets.
Keywords
Internet; XML; document handling; electronic data interchange; Web; XML Documents; XML data; data cleaning; data exchange; data publishing; data quality; duplicate detection system; duplicate identifier; preprocessor; selector; Application software; Cleaning; Computer applications; Couplings; Data engineering; Data preprocessing; Databases; Object detection; Publishing; XML; Data Cleaning; Duplicate Detection; MD5 Algorithm; XML;
fLanguage
English
Publisher
ieee
Conference_Titel
Computer Engineering and Applications (ICCEA), 2010 Second International Conference on
Conference_Location
Bali Island
Print_ISBN
978-1-4244-6079-3
Electronic_ISBN
978-1-4244-6080-9
Type
conf
DOI
10.1109/ICCEA.2010.189
Filename
5445601
Link To Document