DocumentCode :
441630
Title :
A Novel Content and Style Based Measurement of Web Pages Distance
Author :
Zhang, Q.P. ; Liang, M. ; Lai, L.L.
Author_Institution :
Dept. of Computer Science and Engineering, Fudan University, Shanghai 200433, China; E-MAIL: qpzhang@fudan.edu.cn
Volume :
1
fYear :
2005
fDate :
18-21 Aug. 2005
Firstpage :
429
Lastpage :
435
Abstract :
Nowadays, many web-based systems have been using machine learning techniques in order to design more intelligent mechanisms for organizing, indexing, and retrieving web content, and it is necessary for more and more researches and applications to calculate the distance of web pages rationally. Generally proposed methodology is fit for extracting the differences between HTML documents of web pages, results of which cannot be used to tell the actual distance, between the content of web pages and the facade displayed in internet explorers. Based on these above, content distance, style distance, and hybrid distance are proposed in this paper, to make measurement result more practical. The efficiency will be proved through some classical experiments.
Keywords :
Web mining; Web page; cluster; distance function; Computer science; Content based retrieval; Distance measurement; HTML; Internet; Machine learning; Markup languages; Multimedia databases; Web mining; Web pages; Web mining; Web page; cluster; distance function;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Machine Learning and Cybernetics, 2005. Proceedings of 2005 International Conference on
Conference_Location :
Guangzhou, China
Print_ISBN :
0-7803-9091-1
Type :
conf
DOI :
10.1109/ICMLC.2005.1526985
Filename :
1526985
Link To Document :
بازگشت