• DocumentCode
    3025613
  • Title

    The Research of Web Page De-duplication Based on Web Pages Reshipment Statement

  • Author

    Wang, Min-Yan ; Liu, Dong-sheng

  • Author_Institution
    Coll. of Comput. Sci. & Inf. Eng., Zhejiang Gongshang Univ., Hangzhou, China
  • fYear
    2009
  • fDate
    25-26 April 2009
  • Firstpage
    271
  • Lastpage
    274
  • Abstract
    Web page de-duplication module is an important part of search engine system, which can improve its performance and quality with filtering the Web pages downloaded by crawler system of search engine and eliminating the duplicated Web pages. This paper from the source of duplicated Web pages - reshipment proposes a Web page de-duplication method that the information including original Web sites and Web titles are extracted to eliminate duplicated Web pages based on feature codes. Experiments show that this method can achieve satisfactory results in eliminating large-scale duplicated Web pages.
  • Keywords
    Web sites; information filtering; search engines; Web page deduplication; Web page filtering; Web page reshipment statement; Web site; Web title; crawler system; search engine system; Application software; Computer science; Databases; Educational institutions; Information filtering; Information filters; Large-scale systems; Search engines; Uniform resource locators; Web pages; URL; feature codes; reshipment statement; web page de-duplication;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Database Technology and Applications, 2009 First International Workshop on
  • Conference_Location
    Wuhan, Hubei
  • Print_ISBN
    978-0-7695-3604-0
  • Type

    conf

  • DOI
    10.1109/DBTA.2009.64
  • Filename
    5207762