• DocumentCode
    2303963
  • Title

    Experiments on the use of corpus-based word BI-gram in Chinese word segmentation

  • Author

    Xu, Ruifeng ; Yeung, Daniel

  • Author_Institution
    Dept. of Comput., Hong Kong Polytech., Kowloon, Hong Kong
  • Volume
    5
  • fYear
    1998
  • fDate
    11-14 Oct 1998
  • Firstpage
    4222
  • Abstract
    The first step of Chinese language processing is to segment a Chinese sentence into a sequence of words due to the fact that there is no original separation between adjacent words. An efficient corpus-based statistical method is adopted here to address such a problem. In this paper, some word BI-gram statistical measures derived from corpus are employed to remove the segmentation ambiguities. To segment a Chinese sentence, a bidirectional maximum matching method is firstly used to do pre-matching in order to get segmentation candidates and locate possible ambiguities. The statistical measures based on word BI-gram information and word frequency will be used to construct a discriminate function, which is applied to ambiguity strings in order to get an utmost correct segmentation. Experimental results are analyzed to describe the features and limitations of this approach, and preliminary results indicate that our approach is compared favorably to other existing techniques
  • Keywords
    character recognition; image segmentation; natural languages; statistical analysis; Chinese sentence segmentation; Chinese word segmentation; ambiguity strings; bidirectional maximum matching method; corpus-based statistical method; corpus-based word BI-gram; discriminate function; segmentation ambiguities; word BI-gram statistical measures; word frequency; Dictionaries; Frequency measurement; Natural language processing; Natural languages; Particle separators; Probability; Statistical analysis;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Systems, Man, and Cybernetics, 1998. 1998 IEEE International Conference on
  • Conference_Location
    San Diego, CA
  • ISSN
    1062-922X
  • Print_ISBN
    0-7803-4778-1
  • Type

    conf

  • DOI
    10.1109/ICSMC.1998.727508
  • Filename
    727508