• DocumentCode
    1635814
  • Title

    A priority-based method of near-duplicated text information of web pages deletion

  • Author

    Ling, Yun ; Tao, Xiaobo ; Lv, Hexin

  • Author_Institution
    Coll. of Comput. Sci. & Inf. Eng., Zhejiang Gongshang Univ., Hangzhou, China
  • fYear
    2010
  • Firstpage
    495
  • Lastpage
    499
  • Abstract
    Duplicated web pages that search engine returns not only waste storage resources but also increase the burden on web users. According to the near-duplicated phenomenon in the field of employment such as the professional web pages, a new method to detect and delete near-duplicated web page priority-based on text information is proposed. By this method, an algorithm to extract text information of web pages by DOM tree and priority-based algorithm for detecting near-duplicated text information is implemented, so as to reduce the noise of web pages and improve the efficiency of detecting the near-duplicated text information. The experimental results indicate that completely and partly duplicated web pages is detected accurately.
  • Keywords
    Internet; text analysis; Web page deletion; near-duplicated text information; priority-based method; Algorithm design and analysis; Containers; Data mining; Employment; HTML; Noise; Web pages; DOM tree; detect and delete near-duplicated web pages; information extraction; search engine;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Software Engineering and Service Sciences (ICSESS), 2010 IEEE International Conference on
  • Conference_Location
    Beijing
  • Print_ISBN
    978-1-4244-6054-0
  • Type

    conf

  • DOI
    10.1109/ICSESS.2010.5552319
  • Filename
    5552319