DocumentCode
2022470
Title
Web Information Extraction Based on Clustering GHMM
Author
Liu, Yongxin ; Liu, Zhijng
Author_Institution
Sch. of Comput. Sci. & Technol., Xidian Univ., Xian
Volume
1
fYear
2008
fDate
17-18 Oct. 2008
Firstpage
545
Lastpage
548
Abstract
The web pages which are from different sources of network have different form and style. So it is difficult to obtain optimal model by learning from hybrid training pages. In order to improve the accuracy of information extraction, a new approach based on clustering generalized hidden Markov model was proposed. In this approach, the clustering algorithm was applied to web information extraction. The training pages were segregated into a number of clusters by using simple agglomerative hierarchical K-Means clustering (SAHKC) algorithm, and generalized hidden Markov model was trained out through every cluster. Experiment results shows that the new approach could improve the performance of extraction effectively.
Keywords
Web sites; hidden Markov models; information analysis; learning (artificial intelligence); GHMM; Web Pages; Web information extraction; generalized hidden Markov model; learning; simple agglomerative hierarchical K-Means clustering; Clustering algorithms; Collaboration; Computational intelligence; Computer science; Data mining; Explosives; Hidden Markov models; Internet; Web pages; Web sites; K-Means; Web Information Extraction; hidden Markov model;
fLanguage
English
Publisher
ieee
Conference_Titel
Computational Intelligence and Design, 2008. ISCID '08. International Symposium on
Conference_Location
Wuhan
Print_ISBN
978-0-7695-3311-7
Type
conf
DOI
10.1109/ISCID.2008.189
Filename
4725669
Link To Document