• DocumentCode
    2043477
  • Title

    An Efficient Duplicate Detection System for XML Documents

  • Author

    Lwin, Thandar ; Nyunt, Thi Thi Soe

  • Author_Institution
    Univ. of Comput. Studies, Yangon, Myanmar
  • Volume
    2
  • fYear
    2010
  • fDate
    19-21 March 2010
  • Firstpage
    178
  • Lastpage
    182
  • Abstract
    Duplicate detection, which is an important subtask of data cleaning, is the task of identifying multiple representations of a same real-world object and necessary to improve data quality. Numerous approaches both for relational and XML data exist. As XML becomes increasingly popular for data exchange and data publishing on the Web, algorithms to detect duplicates in XML documents are required. Previous domain independent solutions to this problem relied on standard textual similarity functions (e.g., edit distance, cosine metric) between objects. However, such approaches result in large numbers of false positives if we want to identify domain-specific abbreviations and conventions. In this paper, we present the process of detecting duplicate includes three modules, such as selector, preprocessor and duplicate identifier which uses XML documents and candidate definition as input and produces duplicate objects as output. The aim of this research is to develop an efficient algorithm for detecting duplicate in complex XML documents and to reduce number of false positive by using MD5 algorithm. We illustrate the efficiency of this approach on several real-world datasets.
  • Keywords
    Internet; XML; document handling; electronic data interchange; Web; XML Documents; XML data; data cleaning; data exchange; data publishing; data quality; duplicate detection system; duplicate identifier; preprocessor; selector; Application software; Cleaning; Computer applications; Couplings; Data engineering; Data preprocessing; Databases; Object detection; Publishing; XML; Data Cleaning; Duplicate Detection; MD5 Algorithm; XML;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Engineering and Applications (ICCEA), 2010 Second International Conference on
  • Conference_Location
    Bali Island
  • Print_ISBN
    978-1-4244-6079-3
  • Electronic_ISBN
    978-1-4244-6080-9
  • Type

    conf

  • DOI
    10.1109/ICCEA.2010.189
  • Filename
    5445601