• DocumentCode
    2013326
  • Title

    Layout Based Information Extraction from HTML Documents

  • Author

    Burget, Radek

  • Author_Institution
    Brno Univ. of Technol., Brno
  • Volume
    2
  • fYear
    2007
  • fDate
    23-26 Sept. 2007
  • Firstpage
    624
  • Lastpage
    628
  • Abstract
    We propose a method of information extraction from HTML documents based on modelling the visual information in the document. A page segmentation algorithm is used for detecting the document layout and subsequently, the extraction process is based on the analysis of mutual positions of the detected blocks and their visual features. This approach is more robust that the traditional DOM-based methods and it opens new possibilities for the extraction task specification.
  • Keywords
    document handling; hypermedia markup languages; information retrieval; HTML document; document layout detection; document visual information modelling; extraction task specification; layout based information extraction; page segmentation algorithm; visual feature; Algorithm design and analysis; Cascading style sheets; Data mining; HTML; Information analysis; Information technology; Page description languages; Robustness; Text analysis; Web sites;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on
  • Conference_Location
    Parana
  • ISSN
    1520-5363
  • Print_ISBN
    978-0-7695-2822-9
  • Type

    conf

  • DOI
    10.1109/ICDAR.2007.4376990
  • Filename
    4376990