• DocumentCode
    498575
  • Title

    A Novel Method of Chinese Web Information Extraction and Applications

  • Author

    Liu, Zhong ; Wang, Ying

  • Author_Institution
    Chengdu Inst. of Comput. Applic., Chinese Acad. of Sci., Chengdu, China
  • Volume
    1
  • fYear
    2009
  • fDate
    10-11 July 2009
  • Firstpage
    65
  • Lastpage
    68
  • Abstract
    One promising application of natural language processing (NLP) research is in the area of information extraction (IE). In this paper, we present work flow of our IE system for the extraction of semantically rich information from the unstructured or semi-structured Chinese web pages. Knowledge engineering approach and automatic training approach are used to extract pattern and built knowledge repository. General IE system needs to label the unlabeled training Web pages. A novel methodology that does not need to label text is developed, including hierarchy filtration pattern matching based on syntax in best distance method and maximum forward boundary recognition using organization suffix repository and part of speech tagging method. As for applications of IE, a new application system based on IE is built. It is object-level vertical search system and object here is Chinese people, so IE is concerned with extracting people´s related attributes from a collection of web pages about Chinese people. The results are displayed as hierarchy directory tree according to people´s attributes. The system makes user find people quickly and easily.
  • Keywords
    Internet; knowledge engineering; natural language processing; Chinese web information extraction; Web pages; automatic training approach; distance method; filtration pattern matching; knowledge engineering approach; maximum forward boundary recognition; natural language processing research; object-level vertical search system; organization suffix repository; speech tagging method; Data mining; Filtration; Knowledge engineering; Natural language processing; Pattern matching; Pattern recognition; Speech recognition; Tagging; Text recognition; Web pages; information extraction (IE); machine learning(ML); natural language processing (NLP);
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Engineering, 2009. ICIE '09. WASE International Conference on
  • Conference_Location
    Taiyuan, Shanxi
  • Print_ISBN
    978-0-7695-3679-8
  • Type

    conf

  • DOI
    10.1109/ICIE.2009.43
  • Filename
    5211147