Title :
Block-o-Matic: A web page segmentation framework
Author :
Sanoja, Andres ; Gancarski, Stephane
Author_Institution :
LIP6, UPMC, Paris, France
Abstract :
In this paper we describe Block-o-Matic, a web page segmentation framework. It is a hybrid approach inspired by automated document processing methods and visual-based content segmentation techniques. A web page is associated with three structures: the DOM tree, the content structure and the logical structure. The DOM tree represents the HTML elements of a page, the content structure organizes page objects according to content´s categories and geometry and finally the logical structure is the result of mapping content structure on the basis of the human-perceptible meaning that conforms the blocks. The logic structure represents the final segmentation. The segmentation process is divided into three phases: analysis, understanding and reconstruction of a web page. An evaluation is proposed in order to perform the evaluation of web page segmentations based on a ground truth of 400 pages classified into 16 categories. Block-o-Matic gives promising results.
Keywords :
Internet; document handling; Block-o-Matic framework; DOM tree; Web page analysis phase; Web page reconstruction phase; Web page segmentation framework; Web page understanding phase; content structure; document processing method; human-perceptible meaning; logical structure; visual-based content segmentation techniques; Bills of materials; Geometry; HTML; Image segmentation; Layout; Visualization; Web pages; correctness; page segmentation; web pages;
Conference_Titel :
Multimedia Computing and Systems (ICMCS), 2014 International Conference on
Conference_Location :
Marrakech
Print_ISBN :
978-1-4799-3823-0
DOI :
10.1109/ICMCS.2014.6911249