• DocumentCode
    1580663
  • Title

    Web information extraction based on IEBIDTech

  • Author

    Ren, Xiaoyan ; Fu, Yunxia

  • Author_Institution
    China Three Gorges University, Yichang, 443002, China
  • fYear
    2012
  • Firstpage
    239
  • Lastpage
    241
  • Abstract
    Based on the survey of contemporary web information extraction theory, this paper studies the frequently-discussed but insufficiently-solved problem: data extracting from web pages containing several structured records, and proposes a new approach called IEBIDTech(Information Extraction based on Improved Dom Tree) which is mainly composed of three steps. At step 1, the given page is initially segmented into several blocks according to html delimiters after the transformation of a DOM tree, and the redundant blocks are subsequently removed, which is then followed by the induction of extraction rules at step 2 and the extraction of structured data at step 3. Large numbers of experiments from diverse domains´ web pages show that both recall and precision rates are greater than 90%. That is this approach is able to extract data more accurately.
  • Keywords
    DOM tree; Induction algorithm; Web information extraction; Web page denoising; Web page segmentation;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    World Automation Congress (WAC), 2012
  • Conference_Location
    Puerto Vallarta, Mexico
  • ISSN
    2154-4824
  • Print_ISBN
    978-1-4673-4497-5
  • Type

    conf

  • Filename
    6321297