• DocumentCode
    1955326
  • Title

    A Block Segmentation Based Approach for Web Information Extraction

  • Author

    Wang, Changwei ; Sun, Chengjie ; Lin, Lei ; Wang, Xiaolong

  • Author_Institution
    Sch. of Comput. Sci. & Technol., Harbin Inst. of Technol., Harbin, China
  • fYear
    2010
  • fDate
    28-30 Dec. 2010
  • Firstpage
    154
  • Lastpage
    157
  • Abstract
    This paper addresses the issue of web information extraction to support automatic teacher information management. We propose an effective approach based on block segmentation. First, the teacher introduction web pages are divided into independent blocks, where html tags and punctuation marks are used as segmentation criterion. Then CRF model is employed to label the text. We apply this approach on a teacher web page dataset collected from heterogeneous sources. Experimental results indicate that for basic info and contact info extraction our approach achieves an accurate result just using word level features. As extending value features related to education to block level, the performance of our system on the complex educational information extraction task is dramatically improved.
  • Keywords
    Internet; educational administrative data processing; information retrieval; CRF model; HTML tag; Web information extraction; automatic teacher information management; block segmentation; punctuation mark; Data mining; Educational institutions; Feature extraction; HTML; Hidden Markov models; Tagging; Web pages; CRF; block segmentation; information extraction;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Asian Language Processing (IALP), 2010 International Conference on
  • Conference_Location
    Harbin
  • Print_ISBN
    978-1-4244-9063-9
  • Type

    conf

  • DOI
    10.1109/IALP.2010.23
  • Filename
    5681602