• DocumentCode
    1970421
  • Title

    Web page DOM node characterization and its application to page segmentation

  • Author

    Vineel, Gujjar

  • Author_Institution
    GE Res., Comput. & Decision Sci. Lab., India
  • fYear
    2009
  • fDate
    9-11 Dec. 2009
  • Firstpage
    1
  • Lastpage
    6
  • Abstract
    Web pages are generally organized in terms of visually distinct segments, such as Navigation bars, Advertisement banners, Headers, Portlets and Widgets. Despite the apparent structured layout, web pages are considered a source of unstructured data, from information extraction point of view. Hence, as a step towards interpreting the organization of web data, web page segmentation attempts to identify cohesive regions of a page. In this paper, we present a novel DOM tree mining approach for page segmentation. We first characterize the nodes of DOM tree structure, based on their Content Size and Entropy. While Content Size of a node indicates the amount of textual content contributed by its subtree, Entropy measures the strength of local ¿patterns¿ exhibited therein. In other words, a node manifesting highly repetitive patterns begets a high Entropy as per our formulation. Based on the characterization of DOM nodes, we then develop an unsupervised algorithm to automatically identify segments of a given web page.
  • Keywords
    Internet; distributed object management; entropy; DOM tree mining approach; Web page DOM node characterization; advertisement banners; content size; entropy; headers; information extraction; navigation bars; page segmentation application; portlets; unstructured data; visually distinct segments; widgets; Bars; Data mining; Entropy; HTML; Navigation; Size measurement; Tree data structures; Tree graphs; Usability; Web pages; Document Object Model; Entropy; Web Information Extraction; Web Page Segmentation;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Internet Multimedia Services Architecture and Applications (IMSAA), 2009 IEEE International Conference on
  • Conference_Location
    Bangalore
  • Print_ISBN
    978-1-4244-4792-3
  • Electronic_ISBN
    978-1-4244-4793-0
  • Type

    conf

  • DOI
    10.1109/IMSAA.2009.5439444
  • Filename
    5439444