• DocumentCode
    2541554
  • Title

    Region based data extraction

  • Author

    Goh, Pui Leng ; Hong, Jer Lang ; Tan, Ee Xion ; Goh, Wei Wei

  • Author_Institution
    Sch. of Comput. & IT, Taylor´´s Univ., Subang Jaya, Malaysia
  • fYear
    2012
  • fDate
    29-31 May 2012
  • Firstpage
    1196
  • Lastpage
    1200
  • Abstract
    Wrappers are tools used to extract relevant information from HTML pages. Current approaches use DOM tree, visual cue, and ontology to extract data. DOM tree based techniques are fast and simple. However, they are not as accurate as visual based wrappers due to lack of additional information needed to perform data extraction. Visual based wrappers, on the other hand, are slow due to the extra processing needed to obtain visual cue from the underlying browser rendering engine. Ontology based wrappers are accurate, but they are also slow and need manual tuning to operate them. In this paper, we propose a novel visual based wrapper to extract information from HTML pages. Our wrapper uses visual cue to eliminate unnecessary regions, hence reduces the running time of extraction task as our wrapper only needs to consider the relevant region for extraction. Then, our wrapper removes irrelevant data from the relevant region using visual cue. Experiment results show that our wrapper outperforms state-of-the-art wrapper WISH in data extraction.
  • Keywords
    data mining; hypermedia markup languages; ontologies (artificial intelligence); search engines; tree data structures; DOM tree; HTML pages; browser rendering engine; extraction task; ontology based wrapper; region based data extraction; visual based wrapper; visual cue; Data mining; Engines; HTML; Ontologies; Search engines; Visualization; Web sites; Automatic Wrapper; Deep Web; Search Engines;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Fuzzy Systems and Knowledge Discovery (FSKD), 2012 9th International Conference on
  • Conference_Location
    Sichuan
  • Print_ISBN
    978-1-4673-0025-4
  • Type

    conf

  • DOI
    10.1109/FSKD.2012.6233750
  • Filename
    6233750