Semantic keywords-based duplicated web pages removing

Author

Weng, Yunhe ; Li, Lei ; Zhong, Yixin

Author_Institution

Sch. of Inf., Beijing Univ. of Posts & Tele-Commun., Beijing

fYear

2008

fDate

19-22 Oct. 2008

Firstpage

Lastpage

Abstract

Because of many duplicated web pages existing on the web, search engines need to find and remove them, not only for saving process time and hardware resource, but also for ensuring that users can get the result information without many replicas. In this paper, we propose a method to find and remove duplicated Chinese Web pages for search engine. First we describe a scheme based on semantic keywords combined with sentence overlapping, and then show an implemented prototype, with the experimental results that suggest the prototype work well under a proper setting.

Keywords

Web sites; natural language processing; search engines; Chinese Web pages; duplicated Web pages; search engines; semantic keywords; Data engineering; Information retrieval; Libraries; Natural languages; Optical computing; Performance evaluation; Relational databases; Spatial databases; Testing; Web pages; Duplicated web pages; IR; semantic keywords;

fLanguage

English

Publisher

ieee

Conference_Titel

Natural Language Processing and Knowledge Engineering, 2008. NLP-KE '08. International Conference on

Conference_Location

Beijing

Print_ISBN

978-1-4244-4515-8

Electronic_ISBN

978-1-4244-2780-2

Type

conf

DOI

10.1109/NLPKE.2008.4906751

Filename

4906751

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=3300285