مرکز منطقه ای اطلاع رساني علوم و فناوري - A method of automatic web information extraction based on page clustering

DocumentCode :

2550799

Title :

A method of automatic web information extraction based on page clustering

Author :

Yang, Tianqi ; Qiu, Taofen

Author_Institution :

Dept. of Comput. Sci., Jinan Univ., Guangzhou, China

fYear :

2011

fDate :

21-25 June 2011

Firstpage :

390

Lastpage :

393

Abstract :

Dynamic web page has a large amount of pages, high-value data and high- modularity structure. According to these feature, this paper developed an automatic web information extraction system based on page clustering. On the basis of DOM extraction technique, it used page clustering to find the high similarity clusters, and improved the accuracy of clustering results by using the column similarity measure and global auto-similarity measure. Extraction template applied the optional nodes to modify and adjust the template in order to improve the identification of the content nodes. Experimental result shows this method automatically locates and extracts the main information of pages and achieves high precision and recall.

Keywords :

Web sites; content management; information retrieval; pattern clustering; DOM extraction; automatic Web information extraction; column similarity measure; content node; dynamic Web page; extraction template; global auto-similarity measure; high-modularity structure; high-value data; page clustering; similarity cluster; Binary codes; Data mining; Feature extraction; HTML; Knowledge engineering; Web pages; XML; page clustering; web information extraction; wrapper generation;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Intelligent Control and Automation (WCICA), 2011 9th World Congress on

Conference_Location :

Taipei

Print_ISBN :

978-1-61284-698-9

Type :

conf

DOI :

10.1109/WCICA.2011.5970541

Filename :

5970541

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2550799