DocumentCode :
751916
Title :
ViDE: A Vision-Based Approach for Deep Web Data Extraction
Author :
Liu, Wei ; Meng, Xiaofeng ; Meng, Weiyi
Author_Institution :
Sch. of Inf., Renmin Univ. of China, Beijing, China
Volume :
22
Issue :
3
fYear :
2010
fDate :
3/1/2010 12:00:00 AM
Firstpage :
447
Lastpage :
460
Abstract :
Deep Web contents are accessed by queries submitted to Web databases and the returned data records are enwrapped in dynamically generated Web pages (they will be called deep Web pages in this paper). Extracting structured data from deep Web pages is a challenging problem due to the underlying intricate structures of such pages. Until now, a large number of techniques have been proposed to address this problem, but all of them have inherent limitations because they are Web-page-programming-language-dependent. As the popular two-dimensional media, the contents on Web pages are always displayed regularly for users to browse. This motivates us to seek a different way for deep Web data extraction to overcome the limitations of previous works by utilizing some interesting common visual features on the deep Web pages. In this paper, a novel vision-based approach that is Web-page-programming-language-independent is proposed. This approach primarily utilizes the visual features on the deep Web pages to implement deep Web data extraction, including data record extraction and data item extraction. We also propose a new evaluation measure revision to capture the amount of human effort needed to produce perfect extraction. Our experiments on a large set of Web databases show that the proposed vision-based approach is highly effective for deep Web data extraction.
Keywords :
Internet; data structures; query processing; ViDE; Web databases; Web pages; Web-page- programming-language-independent feature; data item extraction; data record extraction; deep web data extraction; structure data; vision-based approach; Web data extraction; Web mining; visual features of deep Web pages; wrapper generation.;
fLanguage :
English
Journal_Title :
Knowledge and Data Engineering, IEEE Transactions on
Publisher :
ieee
ISSN :
1041-4347
Type :
jour
DOI :
10.1109/TKDE.2009.109
Filename :
4840351
Link To Document :
بازگشت