• DocumentCode
    2489224
  • Title

    Detecting Informative Web Page Blocks for Efficient Information Extraction Using Visual Block Segmentation

  • Author

    Kang, Jinbeom ; Choi, Joongmin

  • Author_Institution
    Hanyang Univ., Ansan
  • fYear
    2007
  • fDate
    23-24 Nov. 2007
  • Firstpage
    306
  • Lastpage
    310
  • Abstract
    As the structure of a Web page is getting more complicated, the construction of wrapper induction rules becomes more difficult and time-consuming. The main problem in most wrapper induction methods is the difficulty in discriminating the meaningful blocks that contain the target information from the noise blocks that contains irrelevant information such as advertisements, menus, or copyright statements. To solve this problem, this paper proposes the RIPB(recognizing informative page blocks) algorithm that detects the informative blocks in a Web page by exploiting the visual block segmentation scheme. RIPB uses the visual page segmentation algorithm to analyze and partition a Web page into a set of logical blocks, and then groups related blocks with similar structures into a block cluster and recognizes the informative block clusters by applying some heuristic rules to the cluster information. The results of a series of experiments indicate that RIPB contributes to improve the accuracy of information extraction by allowing the wrapper induction module to focus only on the informative block information and ignore other noise information in building extraction rules.
  • Keywords
    Web sites; information retrieval; learning (artificial intelligence); Web page; cluster information; heuristic rules; information extraction; machine learning; recognizing informative page blocks; visual block segmentation; wrapper induction; Clustering algorithms; Computer science; Data mining; Information analysis; Information technology; Partitioning algorithms; Supervised learning; Target recognition; Training data; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Technology Convergence, 2007. ISITC 2007. International Symposium on
  • Conference_Location
    Joenju
  • Print_ISBN
    0-7695-3045-1
  • Electronic_ISBN
    978-0-7695-3045-1
  • Type

    conf

  • DOI
    10.1109/ISITC.2007.6
  • Filename
    4410655