• DocumentCode
    3300285
  • Title

    Semantic keywords-based duplicated web pages removing

  • Author

    Weng, Yunhe ; Li, Lei ; Zhong, Yixin

  • Author_Institution
    Sch. of Inf., Beijing Univ. of Posts & Tele-Commun., Beijing
  • fYear
    2008
  • fDate
    19-22 Oct. 2008
  • Firstpage
    1
  • Lastpage
    7
  • Abstract
    Because of many duplicated web pages existing on the web, search engines need to find and remove them, not only for saving process time and hardware resource, but also for ensuring that users can get the result information without many replicas. In this paper, we propose a method to find and remove duplicated Chinese Web pages for search engine. First we describe a scheme based on semantic keywords combined with sentence overlapping, and then show an implemented prototype, with the experimental results that suggest the prototype work well under a proper setting.
  • Keywords
    Web sites; natural language processing; search engines; Chinese Web pages; duplicated Web pages; search engines; semantic keywords; Data engineering; Information retrieval; Libraries; Natural languages; Optical computing; Performance evaluation; Relational databases; Spatial databases; Testing; Web pages; Duplicated web pages; IR; semantic keywords;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Natural Language Processing and Knowledge Engineering, 2008. NLP-KE '08. International Conference on
  • Conference_Location
    Beijing
  • Print_ISBN
    978-1-4244-4515-8
  • Electronic_ISBN
    978-1-4244-2780-2
  • Type

    conf

  • DOI
    10.1109/NLPKE.2008.4906751
  • Filename
    4906751