• DocumentCode
    2931316
  • Title

    Learning Rules to Pre-process Web Data for Automatic Integration

  • Author

    Simon, Kai ; Hornung, Thomas ; Lausen, Georg

  • Author_Institution
    Inst. fur Informatik, Univ. Freiburg
  • fYear
    2006
  • fDate
    Nov. 2006
  • Firstpage
    107
  • Lastpage
    116
  • Abstract
    Web pages such as product catalogues and Web sites resulting from querying a search engine often follow a global layout template which facilitates the retrieval of information for a user. In this paper we present a technique which makes such content machine-processable by extracting and transforming it into tabular form. We achieve this goal via ViPER, our fully automatic wrapper system, while localizing and extracting structured data records from suchlike Web pages following a sophisticated strategy based on the visual perception of a Web page. The first contribution of this paper is to give deep insight into the post-processing heuristics of ViPER, which become materialized by a set of rules. Once these rules are defined, the regular content of a Web page can be abstracted into a relational view. Second, we show that new, unseen contents rendered with the same layout, only have to be extracted by ViPER, whereas the remaining transformation can be performed by applying the learned rules accordingly
  • Keywords
    Internet; information retrieval; learning (artificial intelligence); ViPER; Web data preprocessing; Web pages; Web sites; automatic integration; fully automatic wrapper system; information retrieval; learning rules; product catalogues; search engine; structured data record extraction; structured data record localization; Data mining; Databases; HTML; Humans; Information resources; Information retrieval; Internet; Search engines; Visual perception; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Rules and Rule Markup Languages for the Semantic Web, Second International Conference on
  • Conference_Location
    Athens, GA
  • Print_ISBN
    0-7695-2652-7
  • Type

    conf

  • DOI
    10.1109/RULEML.2006.16
  • Filename
    4032397