• DocumentCode
    3489424
  • Title

    Study on the Influencing Factors of Chinese Word Segmentation

  • Author

    Chi Xiu ; Rou Song

  • Author_Institution
    Coll. of Comput., Beijing Univ. of Technol., Beijing, China
  • fYear
    2012
  • fDate
    13-15 Nov. 2012
  • Firstpage
    29
  • Lastpage
    32
  • Abstract
    Out-of-vocabulary words (OOV) and ambiguity are two important issues for Chinese word segmentation (CWS). In previous studies, the measurement of OOV has been clearly stated, while the measurement of ambiguity requires further clarification. This paper puts forward the concept and calculation method of latent ambiguity (LA), analyzes the relation and the mutual influence between OOV, LA and CWS. Experiments show that the real influencing factors are the rates of OOV and LA. Even a small-scale language corpus can reflect the effectiveness of a word segmentation method with high precision. The primary task in CWS is OOV resolution, but at the point where OOV is decreased to achieve an F1 of 0.9, the ambiguity will increase gradually while the training corpus or vocabulary continue to grow, and as a result, the F1 will turn to decrease or remain unchanged instead. Therefore, ambiguity should not be ignored.
  • Keywords
    natural language processing; text analysis; vocabulary; CWS; Chinese word segmentation method; LA; OOV; calculation method; latent ambiguity; mutual influence; out-of-vocabulary words; real influencing factors; training corpus; vocabulary; Birds; Context; Educational institutions; Marine animals; Market research; Training; Vocabulary; Chinese word segmentation; Latent ambiguity; Out-of-vocabulary words;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Asian Language Processing (IALP), 2012 International Conference on
  • Conference_Location
    Hanoi
  • Print_ISBN
    978-1-4673-6113-2
  • Electronic_ISBN
    978-0-7695-4886-9
  • Type

    conf

  • DOI
    10.1109/IALP.2012.62
  • Filename
    6473688