• DocumentCode
    2022470
  • Title

    Web Information Extraction Based on Clustering GHMM

  • Author

    Liu, Yongxin ; Liu, Zhijng

  • Author_Institution
    Sch. of Comput. Sci. & Technol., Xidian Univ., Xian
  • Volume
    1
  • fYear
    2008
  • fDate
    17-18 Oct. 2008
  • Firstpage
    545
  • Lastpage
    548
  • Abstract
    The web pages which are from different sources of network have different form and style. So it is difficult to obtain optimal model by learning from hybrid training pages. In order to improve the accuracy of information extraction, a new approach based on clustering generalized hidden Markov model was proposed. In this approach, the clustering algorithm was applied to web information extraction. The training pages were segregated into a number of clusters by using simple agglomerative hierarchical K-Means clustering (SAHKC) algorithm, and generalized hidden Markov model was trained out through every cluster. Experiment results shows that the new approach could improve the performance of extraction effectively.
  • Keywords
    Web sites; hidden Markov models; information analysis; learning (artificial intelligence); GHMM; Web Pages; Web information extraction; generalized hidden Markov model; learning; simple agglomerative hierarchical K-Means clustering; Clustering algorithms; Collaboration; Computational intelligence; Computer science; Data mining; Explosives; Hidden Markov models; Internet; Web pages; Web sites; K-Means; Web Information Extraction; hidden Markov model;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computational Intelligence and Design, 2008. ISCID '08. International Symposium on
  • Conference_Location
    Wuhan
  • Print_ISBN
    978-0-7695-3311-7
  • Type

    conf

  • DOI
    10.1109/ISCID.2008.189
  • Filename
    4725669