• DocumentCode
    2403436
  • Title

    Detecting changes in XML documents

  • Author

    Cobéna, Grégory ; Abiteboul, Serge ; Marian, Amélie

  • Author_Institution
    INRIA, Rocquencourt, France
  • fYear
    2002
  • fDate
    2002
  • Firstpage
    41
  • Lastpage
    52
  • Abstract
    We present a diff algorithm for XML data. This work is motivated by the support for change control in the context of the Xyleme project that is investigating dynamic warehouses capable of storing massive volumes of XML data. Because of the context, our algorithm has to be very efficient in terms of speed and memory space even at the cost of some loss of quality. Also, it considers, besides insertions, deletions and updates (standard in diffs), a move operation on subtrees that is essential in the context of XML. Intuitively, our diff algorithm uses signatures to match (large) subtrees that were left unchanged between the old and new versions. Such exact matchings are then possibly propagated to ancestors and descendants to obtain more matchings. It also uses XML specific information such as ID attributes. We provide a performance analysis of the algorithm. We show that it runs in average in linear time vs. quadratic time for previous algorithms. We present experiments on synthetic data that confirm the analysis. Since this problem is NP-hard, the linear time is obtained by trading some quality. We present experiments (again on synthetic data) that show that the output of our algorithm is reasonably close to the optimal in terms of quality. Finally we present experiments on a small sample of XML pages found on the Web
  • Keywords
    data warehouses; hypermedia markup languages; information resources; software performance evaluation; tree data structures; ID attributes; NP hard; World Wide Web; XML documents; Xyleme project; data warehouses; deletions; diff algorithm; experiments; insertions; linear time; move operation; performance analysis; quadratic time; subtrees; updates; Change detection algorithms; Costs; Crawlers; Data warehouses; Database languages; HTML; Performance analysis; Subscriptions; Web and internet services; XML;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering, 2002. Proceedings. 18th International Conference on
  • Conference_Location
    San Jose, CA
  • ISSN
    1063-6382
  • Print_ISBN
    0-7695-1531-2
  • Type

    conf

  • DOI
    10.1109/ICDE.2002.994696
  • Filename
    994696