DocumentCode :
2550799
Title :
A method of automatic web information extraction based on page clustering
Author :
Yang, Tianqi ; Qiu, Taofen
Author_Institution :
Dept. of Comput. Sci., Jinan Univ., Guangzhou, China
fYear :
2011
fDate :
21-25 June 2011
Firstpage :
390
Lastpage :
393
Abstract :
Dynamic web page has a large amount of pages, high-value data and high- modularity structure. According to these feature, this paper developed an automatic web information extraction system based on page clustering. On the basis of DOM extraction technique, it used page clustering to find the high similarity clusters, and improved the accuracy of clustering results by using the column similarity measure and global auto-similarity measure. Extraction template applied the optional nodes to modify and adjust the template in order to improve the identification of the content nodes. Experimental result shows this method automatically locates and extracts the main information of pages and achieves high precision and recall.
Keywords :
Web sites; content management; information retrieval; pattern clustering; DOM extraction; automatic Web information extraction; column similarity measure; content node; dynamic Web page; extraction template; global auto-similarity measure; high-modularity structure; high-value data; page clustering; similarity cluster; Binary codes; Data mining; Feature extraction; HTML; Knowledge engineering; Web pages; XML; page clustering; web information extraction; wrapper generation;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Intelligent Control and Automation (WCICA), 2011 9th World Congress on
Conference_Location :
Taipei
Print_ISBN :
978-1-61284-698-9
Type :
conf
DOI :
10.1109/WCICA.2011.5970541
Filename :
5970541
Link To Document :
بازگشت