DocumentCode :
2508429
Title :
A new method on the detection of near-replicas of web pages
Author :
Jia-heng Zheng ; Li-xia Wei ; Hong-ye Tan
Author_Institution :
Dept. of Comput. & Inf. Technol., Shanxi Univ., Taiyuan
fYear :
2008
fDate :
8-11 July 2008
Firstpage :
473
Lastpage :
478
Abstract :
Near-replicas of web pages have seriously decreased the efficiency of search engine (SE). In this paper, we present a new method to detect near-replicas of web pages. Firstly, the styles of text structures in web pages are analyzed and classified; then according to the styles of the text, different methods are used to get the text structure, which will be represented as a matrix; Finally, the similarity will be calculated by extracting the features dynamically from the matrix. Experiments show that this method can not only improve the computing efficiency but also ensure high precision and recall.
Keywords :
Internet; classification; text analysis; Web pages near-replicas; text structure analysis; text structure classification; Blogs; Data mining; Feature extraction; HTML; Indexing; Information analysis; Information technology; Navigation; Search engines; Web pages;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer and Information Technology, 2008. CIT 2008. 8th IEEE International Conference on
Conference_Location :
Sydney, NSW
Print_ISBN :
978-1-4244-2357-6
Electronic_ISBN :
978-1-4244-2358-3
Type :
conf
DOI :
10.1109/CIT.2008.4594721
Filename :
4594721
Link To Document :
بازگشت