• DocumentCode
    2718535
  • Title

    Effectively and efficiently detect web page duplication

  • Author

    Han, Zhongming ; Mo, Qian ; Liu, Hongzhi ; Sun, Jianzhi

  • Author_Institution
    Sch. of Comput. Sci. & Inf. Eng., Beijing Technol. & Bus. Univ., Beijing, China
  • fYear
    2009
  • fDate
    1-4 Nov. 2009
  • Firstpage
    1
  • Lastpage
    6
  • Abstract
    There are a lot of redundant Web pages on Internet. Based on tag statistic and text similarity comparison, we present a novel multilayer framework for detecting duplicated Web pages in this paper. We propose two similarity text paragraphs detection algorithms and implement our framework. The experimental results show that our approach achieves high performance, which means that duplicated Web pages can be efficiently detected simply by tag statistic and text comparison.
  • Keywords
    Internet; Web sites; text analysis; Internet; Web page duplication; similarity text paragraphs detection algorithms; tag statistic; text similarity comparison; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Digital Information Management, 2009. ICDIM 2009. Fourth International Conference on
  • Conference_Location
    Ann Arbor, MI
  • Print_ISBN
    978-1-4244-4253-9
  • Electronic_ISBN
    978-1-4244-4254-6
  • Type

    conf

  • DOI
    10.1109/ICDIM.2009.5356801
  • Filename
    5356801