• DocumentCode
    1627989
  • Title

    Detecting Duplicates in Complex XML Data

  • Author

    Weis, Melanie ; Naumann, Felix

  • Author_Institution
    Humboldt-Universitat zu Berlin
  • fYear
    2006
  • Firstpage
    109
  • Lastpage
    109
  • Abstract
    Recent work both in the relational and the XML world have shown that the efficacy and efficiency of duplicate detection is enhanced by regarding relationships between entities. However, most approaches for XML data rely on 1:n parent/child relationships, and do not apply to XML data that represents m:n relationships. We present a novel comparison strategy, which performs duplicate detection effectively for all kinds of parent/child relationships, given dependencies between different XML elements. Due to cyclic dependencies, it is possible that a pairwise classification is performed more than once, which compromises efficiency. We propose an order that reduces the number of such reclassifications and apply it to two algorithms. The first algorithm performs reclassifications, and efficiency is increased by using the order reducing the number of reclassifications. The second algorithm does not perform a comparison more than once, and the order is used to miss few reclassifications and hence few potential duplicates.
  • Keywords
    Clustering algorithms; Customer relationship management; Data models; Data warehouses; Detection algorithms; Information management; Motion pictures; Object detection; Partitioning algorithms; XML;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering, 2006. ICDE '06. Proceedings of the 22nd International Conference on
  • Print_ISBN
    0-7695-2570-9
  • Type

    conf

  • DOI
    10.1109/ICDE.2006.49
  • Filename
    1617477