• DocumentCode
    2871502
  • Title

    Measuring Similarities between XML Documents Based on Content and Structure

  • Author

    Xia, Xiaoling ; Guo, Yongming ; Le, JiaJin

  • Author_Institution
    Sch. of Comput. Sci. & Technol., Donghua Univ., Shanghai, China
  • Volume
    1
  • fYear
    2009
  • fDate
    18-19 July 2009
  • Firstpage
    459
  • Lastpage
    462
  • Abstract
    Extended Marked-up Language (XML) has become a de facto standard for information representation and data exchange over the Web. As a result, large amounts of XML documents emerge in many application areas, such as Digital library, patent retrieval and Intranet search. Effectively measuring similarities between XML documents plays an important role in these application areas. However, most of current research work focus on measuring the structural similarities between XML documents, and not or less taking into account the content of the documents and links between documents. This paper develops a novel similarity measure model which is based on Extended Vector Space Model. This model can effectively measure similarities between XML documents by combining content, structure and links. In order to evaluate this similarity measure model, we adopt k-means algorithm to cluster XML documents. Experiments show that this model gains better clustering quality compared to the classical vector space model.
  • Keywords
    XML; document handling; Extended Marked-up Language; Intranet search; World Wide Web; XML document similarity; data exchange; digital library; eXtensible Markup Language; extended vector space model; information representation; k-means algorithm; patent retrieval; similarity measure model; Area measurement; Clustering algorithms; Couplings; Current measurement; Extraterrestrial measurements; Information processing; Information representation; Measurement standards; Software libraries; XML; Extended Vector Space Model; XML; document clustering; similarity measure;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Processing, 2009. APCIP 2009. Asia-Pacific Conference on
  • Conference_Location
    Shenzhen
  • Print_ISBN
    978-0-7695-3699-6
  • Type

    conf

  • DOI
    10.1109/APCIP.2009.119
  • Filename
    5197093