Title :
Study on the Influencing Factors of Chinese Word Segmentation
Author :
Chi Xiu ; Rou Song
Author_Institution :
Coll. of Comput., Beijing Univ. of Technol., Beijing, China
Abstract :
Out-of-vocabulary words (OOV) and ambiguity are two important issues for Chinese word segmentation (CWS). In previous studies, the measurement of OOV has been clearly stated, while the measurement of ambiguity requires further clarification. This paper puts forward the concept and calculation method of latent ambiguity (LA), analyzes the relation and the mutual influence between OOV, LA and CWS. Experiments show that the real influencing factors are the rates of OOV and LA. Even a small-scale language corpus can reflect the effectiveness of a word segmentation method with high precision. The primary task in CWS is OOV resolution, but at the point where OOV is decreased to achieve an F1 of 0.9, the ambiguity will increase gradually while the training corpus or vocabulary continue to grow, and as a result, the F1 will turn to decrease or remain unchanged instead. Therefore, ambiguity should not be ignored.
Keywords :
natural language processing; text analysis; vocabulary; CWS; Chinese word segmentation method; LA; OOV; calculation method; latent ambiguity; mutual influence; out-of-vocabulary words; real influencing factors; training corpus; vocabulary; Birds; Context; Educational institutions; Marine animals; Market research; Training; Vocabulary; Chinese word segmentation; Latent ambiguity; Out-of-vocabulary words;
Conference_Titel :
Asian Language Processing (IALP), 2012 International Conference on
Conference_Location :
Hanoi
Print_ISBN :
978-1-4673-6113-2
Electronic_ISBN :
978-0-7695-4886-9
DOI :
10.1109/IALP.2012.62