DocumentCode
2489224
Title
Detecting Informative Web Page Blocks for Efficient Information Extraction Using Visual Block Segmentation
Author
Kang, Jinbeom ; Choi, Joongmin
Author_Institution
Hanyang Univ., Ansan
fYear
2007
fDate
23-24 Nov. 2007
Firstpage
306
Lastpage
310
Abstract
As the structure of a Web page is getting more complicated, the construction of wrapper induction rules becomes more difficult and time-consuming. The main problem in most wrapper induction methods is the difficulty in discriminating the meaningful blocks that contain the target information from the noise blocks that contains irrelevant information such as advertisements, menus, or copyright statements. To solve this problem, this paper proposes the RIPB(recognizing informative page blocks) algorithm that detects the informative blocks in a Web page by exploiting the visual block segmentation scheme. RIPB uses the visual page segmentation algorithm to analyze and partition a Web page into a set of logical blocks, and then groups related blocks with similar structures into a block cluster and recognizes the informative block clusters by applying some heuristic rules to the cluster information. The results of a series of experiments indicate that RIPB contributes to improve the accuracy of information extraction by allowing the wrapper induction module to focus only on the informative block information and ignore other noise information in building extraction rules.
Keywords
Web sites; information retrieval; learning (artificial intelligence); Web page; cluster information; heuristic rules; information extraction; machine learning; recognizing informative page blocks; visual block segmentation; wrapper induction; Clustering algorithms; Computer science; Data mining; Information analysis; Information technology; Partitioning algorithms; Supervised learning; Target recognition; Training data; Web pages;
fLanguage
English
Publisher
ieee
Conference_Titel
Information Technology Convergence, 2007. ISITC 2007. International Symposium on
Conference_Location
Joenju
Print_ISBN
0-7695-3045-1
Electronic_ISBN
978-0-7695-3045-1
Type
conf
DOI
10.1109/ISITC.2007.6
Filename
4410655
Link To Document