• DocumentCode
    379129
  • Title

    Reverse engineering for Web data: from visual to semantic structures

  • Author

    Chung, Christina Yip ; Gertz, Michael ; Sundaresan, Neel

  • Author_Institution
    Verity Inc., Sunnyvale, CA, USA
  • fYear
    2002
  • fDate
    2002
  • Firstpage
    53
  • Lastpage
    63
  • Abstract
    Despite the advancement of XML, the majority of documents on the Web is still marked up with HTML for visual rendering purposes only, thus building a huge amount of legacy data. In order to facilitate querying Web based data in a way more efficient and effective than just keyword based retrieval, enriching such Web documents with both structure and semantics is necessary. We describe a novel approach to the integration of topic specific HTML documents into a repository of XML documents. In particular, we describe how topic specific HTML documents are transformed into XML documents. The proposed document transformation and semantic element tagging process utilizes document restructuring rules and minimum information about the topic in the form of concepts. For the resulting XML documents, a majority schema is derived that describes common structures among the documents in the form of a DTD. We explore and discuss different techniques, and rules for document conversion and majority schema discovery. We finally demonstrate the feasibility and effectiveness of our approach by applying it to a set of resume HTML documents gathered by a Web crawler
  • Keywords
    Internet; hypermedia markup languages; information resources; query processing; reverse engineering; HTML; Web crawler; World Wide Web data; XML; document conversion; document restructuring rules; document transformation; keyword based retrieval; majority schema; querying; reverse engineering; semantic element tagging; semantic structures; semantics; topic specific documents; visual structures; Computer science; Crawlers; Database languages; HTML; Information retrieval; Query processing; Reverse engineering; Semantic Web; Writing; XML;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering, 2002. Proceedings. 18th International Conference on
  • Conference_Location
    San Jose, CA
  • ISSN
    1063-6382
  • Print_ISBN
    0-7695-1531-2
  • Type

    conf

  • DOI
    10.1109/ICDE.2002.994697
  • Filename
    994697