• DocumentCode
    729468
  • Title

    Extracting news content with visual unit of web pages

  • Author

    Wenhao Zhu ; Song Dai ; Yang Song ; Zhiguo Lu

  • Author_Institution
    Sch. of Comput. Eng. & Sci., Shanghai Univ., Shanghai, China
  • fYear
    2015
  • fDate
    1-3 June 2015
  • Firstpage
    1
  • Lastpage
    5
  • Abstract
    The Document Object Model (DOM) provides a tree structure called DOM tree for representing with objects in HTML. Many researchers have considered using leaf nodes of DOM tree as basic objects in extracting information from web pages. However, web pages are more of information blocks which each have a consistent visual format rather than individual DOM tree nodes. And those information blocks do not necessarily have a direct map to DOM tree nodes. In this paper, we propose a visual oriented extraction method that extracts news content by visual unit (vu, for short). Visual units are identified by a top-down approach based on visual features and text features. After that, page content is extracted according to domain characteristic. In experiments, the proposed approach achieves 94.86% accuracy over 700 news web pages from 7 different news sites. The result demonstrates that our method represents a promising approach for news content extraction with visual units and domain characteristic.
  • Keywords
    Internet; hypermedia markup languages; DOM tree nodes; HTML; Web pages; direct map; document object model; information blocks; information extraction; news content extraction; text features; visual features; visual oriented extraction method; visual unit; Accuracy; Data mining; Feature extraction; HTML; Visualization; Web pages; DOM; information extraction; visual unit;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), 2015 16th IEEE/ACIS International Conference on
  • Conference_Location
    Takamatsu
  • Type

    conf

  • DOI
    10.1109/SNPD.2015.7176203
  • Filename
    7176203