DocumentCode :
658650
Title :
Vision-Based Web Page Block Segmentation and Informative Block Detection
Author :
Xuhong Zhang ; Yanqing Zhang ; Jing He ; Cobia, Frank
Author_Institution :
Dept. of Comput. Sci., Georgia State Univ., Atlanta, GA, USA
Volume :
3
fYear :
2013
fDate :
17-20 Nov. 2013
Firstpage :
265
Lastpage :
269
Abstract :
A web page usually contains various content such as main content block, navigation bars or columns, contacts information at the bottom, advertisements, or just some decoration components. Apart from the main content block, the other parts are all not related to the topic of the web page, we call them noise blocks. Content in noise blocks will seriously harm information extraction, web mining, we searching, etc. Identifying the main content block is a key issue. In this paper, we propose a new low-cost vision based web page segmentation and informative block detection algorithm. Then we apply this algorithm to develop a bids update detection system for a local bids collecting company. Our algorithm mainly utilizes the position information of HTML element, which lowers the cost of applying multiple rules in VIPS. Our proposed row-column splitting indicator helps us provide an easy-to-use partition granularity value which solves the difficulty of choosing an appropriate Degree of Coherence (DoC) value in VIPS algorithm. This row-column splitting indicator also avoids the traditional high cost clustering process in detecting informative block. Through extensive experiments, we conclude that our proposed algorithm has a comparable performance compared to the VIPS and other informative block detection algorithms, but with ease of use granularity value and lower computation cost.
Keywords :
Internet; computer vision; image segmentation; object detection; DoC value; HTML element position information; VIPS algorithm; advertisements; bids update detection system; clustering process; contacts information; content block identification; decoration components; degree-of-coherence; informative block detection algorithm; navigation bars; navigation columns; noise blocks; partition granularity value; row-column splitting indicator; vision-based Web page block segmentation; Companies; Detection algorithms; HTML; Noise; Partitioning algorithms; Visualization; Web pages; Bids; Informative Block; Segmentation; VIPS;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Web Intelligence (WI) and Intelligent Agent Technologies (IAT), 2013 IEEE/WIC/ACM International Joint Conferences on
Conference_Location :
Atlanta, GA
Print_ISBN :
978-1-4799-2902-3
Type :
conf
DOI :
10.1109/WI-IAT.2013.194
Filename :
6690739
Link To Document :
بازگشت