• DocumentCode
    3230385
  • Title

    A Generalized Hidden Markov Model Approach for Web Information Extraction

  • Author

    Zhong, Ping ; Chen, Jinlin

  • Author_Institution
    Dept. of Comput. Sci., City Univ. of New York, NY
  • fYear
    2006
  • fDate
    18-22 Dec. 2006
  • Firstpage
    709
  • Lastpage
    718
  • Abstract
    A generalized hidden Markov model (GHMM) which extends traditional HMMs by making use of Web-specific information for Web information extraction is presented in this paper. Web content blocks are used instead of content terms as basic extraction unit in our approach. Besides, instead of using the traditional sequential state transition order, the state transition orders of GHMMs are detected based on layout structures of the corresponding Web pages. Furthermore, multiple emission features are applied instead of single emission feature. In this way GHMMs can better accommodate Web information extraction. Experiments show promising results of GHMMs
  • Keywords
    Internet; hidden Markov models; information retrieval; Web content blocks; Web information extraction; generalized hidden Markov model approach; multiple emission features; state transition orders; Computer science; Data mining; Educational institutions; Hidden Markov models; Intelligent structures; Learning systems; Parameter estimation; State estimation; Stochastic processes; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Web Intelligence, 2006. WI 2006. IEEE/WIC/ACM International Conference on
  • Conference_Location
    Hong Kong
  • Print_ISBN
    0-7695-2747-7
  • Type

    conf

  • DOI
    10.1109/WI.2006.13
  • Filename
    4061457