Web Information Extraction Based on Clustering GHMM

Author

Liu, Yongxin ; Liu, Zhijng

Author_Institution

Sch. of Comput. Sci. & Technol., Xidian Univ., Xian

Volume

1

fYear

2008

fDate

17-18 Oct. 2008

Firstpage

545

Lastpage

548

Abstract

The web pages which are from different sources of network have different form and style. So it is difficult to obtain optimal model by learning from hybrid training pages. In order to improve the accuracy of information extraction, a new approach based on clustering generalized hidden Markov model was proposed. In this approach, the clustering algorithm was applied to web information extraction. The training pages were segregated into a number of clusters by using simple agglomerative hierarchical K-Means clustering (SAHKC) algorithm, and generalized hidden Markov model was trained out through every cluster. Experiment results shows that the new approach could improve the performance of extraction effectively.

Keywords

Web sites; hidden Markov models; information analysis; learning (artificial intelligence); GHMM; Web Pages; Web information extraction; generalized hidden Markov model; learning; simple agglomerative hierarchical K-Means clustering; Clustering algorithms; Collaboration; Computational intelligence; Computer science; Data mining; Explosives; Hidden Markov models; Internet; Web pages; Web sites; K-Means; Web Information Extraction; hidden Markov model;

fLanguage

English

Publisher

ieee

Conference_Titel

Computational Intelligence and Design, 2008. ISCID '08. International Symposium on

Conference_Location

Wuhan

Print_ISBN

978-0-7695-3311-7

Type

conf

DOI

10.1109/ISCID.2008.189

Filename

4725669