Title :
Web Information Extraction Based on Clustering GHMM
Author :
Liu, Yongxin ; Liu, Zhijng
Author_Institution :
Sch. of Comput. Sci. & Technol., Xidian Univ., Xian
Abstract :
The web pages which are from different sources of network have different form and style. So it is difficult to obtain optimal model by learning from hybrid training pages. In order to improve the accuracy of information extraction, a new approach based on clustering generalized hidden Markov model was proposed. In this approach, the clustering algorithm was applied to web information extraction. The training pages were segregated into a number of clusters by using simple agglomerative hierarchical K-Means clustering (SAHKC) algorithm, and generalized hidden Markov model was trained out through every cluster. Experiment results shows that the new approach could improve the performance of extraction effectively.
Keywords :
Web sites; hidden Markov models; information analysis; learning (artificial intelligence); GHMM; Web Pages; Web information extraction; generalized hidden Markov model; learning; simple agglomerative hierarchical K-Means clustering; Clustering algorithms; Collaboration; Computational intelligence; Computer science; Data mining; Explosives; Hidden Markov models; Internet; Web pages; Web sites; K-Means; Web Information Extraction; hidden Markov model;
Conference_Titel :
Computational Intelligence and Design, 2008. ISCID '08. International Symposium on
Conference_Location :
Wuhan
Print_ISBN :
978-0-7695-3311-7
DOI :
10.1109/ISCID.2008.189