• DocumentCode
    2450578
  • Title

    A Web Page De-duplication Algorithm Based on Data Clearing

  • Author

    Lin, Jian-ming ; Liu, Dong-sheng ; Gao, Shi-wen ; Chen, Wei

  • Author_Institution
    Sch. of Bus. Adm., Zhejiang Gongshang Univ., Hangzhou, China
  • fYear
    2009
  • fDate
    25-26 April 2009
  • Firstpage
    544
  • Lastpage
    547
  • Abstract
    Duplicated web pages responded by search engines not only waste valuable storage, but also aggravate burdens of userspsila browse. Web page de-duplication can effectively improve the information retrieval. This paper proposes pretreatment of web pages to improve the effectiveness and efficiency of web page de-duplication based on feature code according to the principle of data clearing. This paper features that ranking feature code to reduce the comparison times of the system and space and time complexity. Experiments show that this method has a promising prospect in eliminating large-scale duplicated web pages.
  • Keywords
    Internet; Web sites; information retrieval; search engines; Web page deduplication; data clearing; information retrieval; search engines; Cleaning; Computer science; Data engineering; Data mining; Educational institutions; Feature extraction; Information retrieval; Internet; Search engines; Web pages; data cleaning; feature codes; reshipment statement; web page de-duplication;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Artificial Intelligence, 2009. JCAI '09. International Joint Conference on
  • Conference_Location
    Hainan Island
  • Print_ISBN
    978-0-7695-3615-6
  • Type

    conf

  • DOI
    10.1109/JCAI.2009.181
  • Filename
    5159062