• DocumentCode
    1531336
  • Title

    Repetition-based web page segmentation by detecting tag patterns for small-screen devices

  • Author

    Kang, Jinbeom ; Yang, Jaeyoung ; Choi, Joongmin

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Hanyang Univ., Ansan, South Korea
  • Volume
    56
  • Issue
    2
  • fYear
    2010
  • fDate
    5/1/2010 12:00:00 AM
  • Firstpage
    980
  • Lastpage
    986
  • Abstract
    Web page segmentation into logical blocks is an important preprocessing step for recognizing informative content blocks in a page that leads to efficient information extraction and convenient display on the devices with smallsized screens. Previous methods for Web page segmentation are not flexible in a dynamic Web environment because they largely relied on heuristic rules generated by exploiting structural tags and visual information inherent in a page. To resolve this problem, this paper proposes a new method of Web page segmentation by recognizing repetitive tag patterns called key patterns in the DOM tree structure of a page. We report on the Repetition-based Page Segmentation (REPS) algorithm, which detects key patterns in a page and generates virtual nodes to correctly segment nested blocks. A series of experiments performed for real Web sites showed that REPS greatly contributes to improving the correctness of Web page segmentation.
  • Keywords
    Computer displays; Computer science; Data mining; HTML; Large screen displays; Mobile communication; Mobile computing; Pattern recognition; Tree data structures; Web pages; Web Page Segmentation, REPS, Key Patterns, Information Extraction;
  • fLanguage
    English
  • Journal_Title
    Consumer Electronics, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0098-3063
  • Type

    jour

  • DOI
    10.1109/TCE.2010.5506029
  • Filename
    5506029