DocumentCode
2871502
Title
Measuring Similarities between XML Documents Based on Content and Structure
Author
Xia, Xiaoling ; Guo, Yongming ; Le, JiaJin
Author_Institution
Sch. of Comput. Sci. & Technol., Donghua Univ., Shanghai, China
Volume
1
fYear
2009
fDate
18-19 July 2009
Firstpage
459
Lastpage
462
Abstract
Extended Marked-up Language (XML) has become a de facto standard for information representation and data exchange over the Web. As a result, large amounts of XML documents emerge in many application areas, such as Digital library, patent retrieval and Intranet search. Effectively measuring similarities between XML documents plays an important role in these application areas. However, most of current research work focus on measuring the structural similarities between XML documents, and not or less taking into account the content of the documents and links between documents. This paper develops a novel similarity measure model which is based on Extended Vector Space Model. This model can effectively measure similarities between XML documents by combining content, structure and links. In order to evaluate this similarity measure model, we adopt k-means algorithm to cluster XML documents. Experiments show that this model gains better clustering quality compared to the classical vector space model.
Keywords
XML; document handling; Extended Marked-up Language; Intranet search; World Wide Web; XML document similarity; data exchange; digital library; eXtensible Markup Language; extended vector space model; information representation; k-means algorithm; patent retrieval; similarity measure model; Area measurement; Clustering algorithms; Couplings; Current measurement; Extraterrestrial measurements; Information processing; Information representation; Measurement standards; Software libraries; XML; Extended Vector Space Model; XML; document clustering; similarity measure;
fLanguage
English
Publisher
ieee
Conference_Titel
Information Processing, 2009. APCIP 2009. Asia-Pacific Conference on
Conference_Location
Shenzhen
Print_ISBN
978-0-7695-3699-6
Type
conf
DOI
10.1109/APCIP.2009.119
Filename
5197093
Link To Document