• DocumentCode
    1583693
  • Title

    Classification of HTML documents by Hidden Tree-Markov Models

  • Author

    Diligenti, M. ; Gori, M. ; Maggini, M. ; Scarselli, E.

  • Author_Institution
    Dipt. di Ingegneria dell´´Inf., Siena Univ., Italy
  • fYear
    2001
  • fDate
    6/23/1905 12:00:00 AM
  • Firstpage
    849
  • Lastpage
    853
  • Abstract
    Content-based search and organization of Web documents poses new issues in information retrieval. We propose a novel approach for the classification of HTML documents based on a structured representation of their contents which are split into logical contexts (paragraphs, sections, anchors, etc.). The classification is performed using Hidden Tree-Markov Models (HTMMs), an extension of Hidden Markov Models for processing structured objects. We report some promising experimental results showing that the use of the structured representation improves the classification accuracy in most of the cases
  • Keywords
    content-based retrieval; document image processing; hidden Markov models; hypermedia markup languages; image classification; HTML documents; Hidden Tree-Markov Models; Web documents; content-based search; structured representation; Classification algorithms; Classification tree analysis; Content based retrieval; HTML; Hidden Markov models; Hydrogen; Information retrieval; Internet; Text categorization; Tree graphs;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition, 2001. Proceedings. Sixth International Conference on
  • Conference_Location
    Seattle, WA
  • Print_ISBN
    0-7695-1263-1
  • Type

    conf

  • DOI
    10.1109/ICDAR.2001.953907
  • Filename
    953907