DocumentCode
3489424
Title
Study on the Influencing Factors of Chinese Word Segmentation
Author
Chi Xiu ; Rou Song
Author_Institution
Coll. of Comput., Beijing Univ. of Technol., Beijing, China
fYear
2012
fDate
13-15 Nov. 2012
Firstpage
29
Lastpage
32
Abstract
Out-of-vocabulary words (OOV) and ambiguity are two important issues for Chinese word segmentation (CWS). In previous studies, the measurement of OOV has been clearly stated, while the measurement of ambiguity requires further clarification. This paper puts forward the concept and calculation method of latent ambiguity (LA), analyzes the relation and the mutual influence between OOV, LA and CWS. Experiments show that the real influencing factors are the rates of OOV and LA. Even a small-scale language corpus can reflect the effectiveness of a word segmentation method with high precision. The primary task in CWS is OOV resolution, but at the point where OOV is decreased to achieve an F1 of 0.9, the ambiguity will increase gradually while the training corpus or vocabulary continue to grow, and as a result, the F1 will turn to decrease or remain unchanged instead. Therefore, ambiguity should not be ignored.
Keywords
natural language processing; text analysis; vocabulary; CWS; Chinese word segmentation method; LA; OOV; calculation method; latent ambiguity; mutual influence; out-of-vocabulary words; real influencing factors; training corpus; vocabulary; Birds; Context; Educational institutions; Marine animals; Market research; Training; Vocabulary; Chinese word segmentation; Latent ambiguity; Out-of-vocabulary words;
fLanguage
English
Publisher
ieee
Conference_Titel
Asian Language Processing (IALP), 2012 International Conference on
Conference_Location
Hanoi
Print_ISBN
978-1-4673-6113-2
Electronic_ISBN
978-0-7695-4886-9
Type
conf
DOI
10.1109/IALP.2012.62
Filename
6473688
Link To Document