DocumentCode
1970421
Title
Web page DOM node characterization and its application to page segmentation
Author
Vineel, Gujjar
Author_Institution
GE Res., Comput. & Decision Sci. Lab., India
fYear
2009
fDate
9-11 Dec. 2009
Firstpage
1
Lastpage
6
Abstract
Web pages are generally organized in terms of visually distinct segments, such as Navigation bars, Advertisement banners, Headers, Portlets and Widgets. Despite the apparent structured layout, web pages are considered a source of unstructured data, from information extraction point of view. Hence, as a step towards interpreting the organization of web data, web page segmentation attempts to identify cohesive regions of a page. In this paper, we present a novel DOM tree mining approach for page segmentation. We first characterize the nodes of DOM tree structure, based on their Content Size and Entropy. While Content Size of a node indicates the amount of textual content contributed by its subtree, Entropy measures the strength of local ¿patterns¿ exhibited therein. In other words, a node manifesting highly repetitive patterns begets a high Entropy as per our formulation. Based on the characterization of DOM nodes, we then develop an unsupervised algorithm to automatically identify segments of a given web page.
Keywords
Internet; distributed object management; entropy; DOM tree mining approach; Web page DOM node characterization; advertisement banners; content size; entropy; headers; information extraction; navigation bars; page segmentation application; portlets; unstructured data; visually distinct segments; widgets; Bars; Data mining; Entropy; HTML; Navigation; Size measurement; Tree data structures; Tree graphs; Usability; Web pages; Document Object Model; Entropy; Web Information Extraction; Web Page Segmentation;
fLanguage
English
Publisher
ieee
Conference_Titel
Internet Multimedia Services Architecture and Applications (IMSAA), 2009 IEEE International Conference on
Conference_Location
Bangalore
Print_ISBN
978-1-4244-4792-3
Electronic_ISBN
978-1-4244-4793-0
Type
conf
DOI
10.1109/IMSAA.2009.5439444
Filename
5439444
Link To Document