• DocumentCode
    2698219
  • Title

    Web informative content block detecting based on entropy and parent-child relationship in DOM

  • Author

    Ding, Yanhui ; Li, Qingzhong ; Yan, Zhongmin ; Dong, Yongquan

  • Author_Institution
    Sch. of Comput. Sci. & Technol., Shandong Univ., Jinan
  • fYear
    2008
  • fDate
    20-23 June 2008
  • Firstpage
    175
  • Lastpage
    178
  • Abstract
    To increase the commercial value and accessibility of pages, most sites tend to publish their pages with redundant information, such as navigation panels, advertisements, and copyright announcements. Such redundant information almost exists in all pages of the website, which increases the index size of general search engines and causes page topics to drift. In this paper, we propose an informative content blocks detecting system called WICBDPCR (Web Informative Content Block Detecting based on Parent-Child Relationship in the document object model) which applies Information Theory to DOM tree in order to detect the informative structure. Experiments on several real commercial Web sites show high precision and recall rates, which validate WICBDPCRpsilas practical applicability.
  • Keywords
    Web sites; document handling; entropy; information retrieval; search engines; tree data structures; Web informative content block detection system; Web sites; document object model tree; entropy method; information theory; parent-child relationship; search engine; Automation; Computer science; Data mining; Entropy; IEEE news; Information theory; Navigation; Object detection; Search engines; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information and Automation, 2008. ICIA 2008. International Conference on
  • Conference_Location
    Changsha
  • Print_ISBN
    978-1-4244-2183-1
  • Electronic_ISBN
    978-1-4244-2184-8
  • Type

    conf

  • DOI
    10.1109/ICINFA.2008.4607991
  • Filename
    4607991