• DocumentCode
    2012084
  • Title

    Web Document Analysis Based on Visual Segmentation and Page Rendering

  • Author

    Cong Kinh Nguyen ; Likforman-Sulem, Laurence ; Moissinac, Jean-Claude ; Faure, Claudie ; Lardon, Jérémy

  • Author_Institution
    Telecom-ParisTech, Paris, France
  • fYear
    2012
  • fDate
    27-29 March 2012
  • Firstpage
    354
  • Lastpage
    358
  • Abstract
    This paper proposes an approach for segmenting a Web page into its semantic parts. Such analysis may be useful for adapting blog or other pages on small devices. In this approach, we take advantage of both dynamic layout after rendering and textual information. Our method segments the page into blocks and then classifies the blocks. A classification in semantic parts is performed thanks to a SVM-based machine learning approach using a set of 30 textual and visual-based features. Evaluation is conducted on a Web blog database. Results are provided for both block classification and blog segmentation into articles.
  • Keywords
    Web sites; document handling; learning (artificial intelligence); pattern classification; rendering (computer graphics); support vector machines; SVM-based machine learning approach; Web blog database; Web document analysis; Web page segmentation; block classification; block segmentation; dynamic layout; page rendering; textual features; textual information; visual segmentation; visual-based features; Conferences; Text analysis; Internet document; Web page segmentation; block segmentation; semantic block;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis Systems (DAS), 2012 10th IAPR International Workshop on
  • Conference_Location
    Gold Cost, QLD
  • Print_ISBN
    978-1-4673-0868-7
  • Type

    conf

  • DOI
    10.1109/DAS.2012.95
  • Filename
    6195393