Title :
Precise web page segmentation based on semantic block headers detection
Author :
Zhang, Aihua ; Jing, Jiwu ; Kang, Le ; Zhang, Lingchen
Author_Institution :
Dept. of Electron. Eng. & Inf. Sci., Univ. of Sci. & Technol. of China, Hefei, China
Abstract :
Web page segmentation is an important technology for web-driven applications such as search engine and web browser on mobile device. Currently, the researches in this field attempted to mine the features of visual presentation and document structure, but it is difficult to choose proper features to obtain a precise result. The approach which focuses on either vision-based method or DOM structure analysis has its defect and is not providing enough satisfaction for practice. This paper presents a novel algorithm for web page segmentation. By extracting the block headers, the algorithm is able to partition the web page into semantic blocks. The algorithm exploits both the visual features and the structural features in web page from a simple but novel perspective. We apply this algorithm to a group of real world web pages as verification and obtain a very positive result.
Keywords :
Web sites; mobile computing; online front-ends; search engines; DOM structure analysis; Web browser; Web page segmentation; Web-driven applications; document object model; document structure; mobile device; search engine; semantic block headers detection; vision-based method; visual presentation; block header; block node; content row; web page segmentation;
Conference_Titel :
Digital Content, Multimedia Technology and its Applications (IDC), 2010 6th International Conference on
Conference_Location :
Seoul
Print_ISBN :
978-1-4244-7607-7
Electronic_ISBN :
978-8-9886-7827-5