• DocumentCode
    2492010
  • Title

    Building Web Page Logical Structure Model towards Effective Metadata Extraction

  • Author

    Zhou, Baoyao ; Zhang, Ming

  • Author_Institution
    Hewlett-Packard Labs. China, Beijing, China
  • fYear
    2010
  • fDate
    6-8 April 2010
  • Firstpage
    401
  • Lastpage
    401
  • Abstract
    Web pages are typical semi-structure data. Some tree-based models have been proposed to describe the semantic content structure of web pages in order to facilitate further content analysis. However, most existing models only present the segmentation hierarchy of content blocks rather than the semantic relationships among them. In this work, we propose a novel web page semantic structure model, called Logical Structure Model. It can present more comprehensive structure information of web pages. Based on this model, the hidden patterns in web content can be revealed easier. The proposed model has been used to facilitate identifying course metadata in our Online Course Organization project, which aims to build an online course portal to serve the course information obtained from the Web.
  • Keywords
    Web design; computer aided instruction; content management; educational courses; meta data; semantic Web; Web content; Web page semantic structure model; content analysis; content block; course information; course metadata; logical structure model; metadata extraction; online course portal; segmentation hierarchy; semantic content structure; semantic relationship; semistructure data; structure information; tree-based model; Blogs; Buildings; Computer science; Data mining; Data structures; HTML; Industrial relations; Portals; Technological innovation; Web pages; web metadata extraction; web page logical structure model;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Web Conference (APWEB), 2010 12th International Asia-Pacific
  • Conference_Location
    Busan
  • Print_ISBN
    978-1-7695-4012-2
  • Electronic_ISBN
    978-1-4244-6600-9
  • Type

    conf

  • DOI
    10.1109/APWeb.2010.81
  • Filename
    5474101