مرکز منطقه ای اطلاع رساني علوم و فناوري - A new method on the detection of near-replicas of web pages

DocumentCode :

2508429

Title :

A new method on the detection of near-replicas of web pages

Author :

Jia-heng Zheng ; Li-xia Wei ; Hong-ye Tan

Author_Institution :

Dept. of Comput. & Inf. Technol., Shanxi Univ., Taiyuan

fYear :

2008

fDate :

8-11 July 2008

Firstpage :

473

Lastpage :

478

Abstract :

Near-replicas of web pages have seriously decreased the efficiency of search engine (SE). In this paper, we present a new method to detect near-replicas of web pages. Firstly, the styles of text structures in web pages are analyzed and classified; then according to the styles of the text, different methods are used to get the text structure, which will be represented as a matrix; Finally, the similarity will be calculated by extracting the features dynamically from the matrix. Experiments show that this method can not only improve the computing efficiency but also ensure high precision and recall.

Keywords :

Internet; classification; text analysis; Web pages near-replicas; text structure analysis; text structure classification; Blogs; Data mining; Feature extraction; HTML; Indexing; Information analysis; Information technology; Navigation; Search engines; Web pages;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Computer and Information Technology, 2008. CIT 2008. 8th IEEE International Conference on

Conference_Location :

Sydney, NSW

Print_ISBN :

978-1-4244-2357-6

Electronic_ISBN :

978-1-4244-2358-3

Type :

conf

DOI :

10.1109/CIT.2008.4594721

Filename :

4594721

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2508429