Title :
Measuring Similarities between XML Documents Based on Content and Structure
Author :
Xia, Xiaoling ; Guo, Yongming ; Le, JiaJin
Author_Institution :
Sch. of Comput. Sci. & Technol., Donghua Univ., Shanghai, China
Abstract :
Extended Marked-up Language (XML) has become a de facto standard for information representation and data exchange over the Web. As a result, large amounts of XML documents emerge in many application areas, such as Digital library, patent retrieval and Intranet search. Effectively measuring similarities between XML documents plays an important role in these application areas. However, most of current research work focus on measuring the structural similarities between XML documents, and not or less taking into account the content of the documents and links between documents. This paper develops a novel similarity measure model which is based on Extended Vector Space Model. This model can effectively measure similarities between XML documents by combining content, structure and links. In order to evaluate this similarity measure model, we adopt k-means algorithm to cluster XML documents. Experiments show that this model gains better clustering quality compared to the classical vector space model.
Keywords :
XML; document handling; Extended Marked-up Language; Intranet search; World Wide Web; XML document similarity; data exchange; digital library; eXtensible Markup Language; extended vector space model; information representation; k-means algorithm; patent retrieval; similarity measure model; Area measurement; Clustering algorithms; Couplings; Current measurement; Extraterrestrial measurements; Information processing; Information representation; Measurement standards; Software libraries; XML; Extended Vector Space Model; XML; document clustering; similarity measure;
Conference_Titel :
Information Processing, 2009. APCIP 2009. Asia-Pacific Conference on
Conference_Location :
Shenzhen
Print_ISBN :
978-0-7695-3699-6
DOI :
10.1109/APCIP.2009.119