DocumentCode :
2037526
Title :
A statistical approach for resolving problematical word boundaries in Chinese lexicography
Author :
Kwong, OI Yee ; Tsou, Benjamin K.
Author_Institution :
Language Inf. Sci. Res. Centre, City Univ. of Hong Kong, Kowloon, China
Volume :
4
fYear :
2001
fDate :
2001
Firstpage :
2199
Abstract :
Word segmentation is an important topic in Chinese language processing. Although state-of-the-art segmentation algorithms demonstrate that more than 90% accuracy could possibly be achieved, there remains the subtle question of what constitutes a Chinese word. In this paper, we focus on two-character word strings which often raise doubts even for lexicographers as to whether the two characters should be segmented or kept as one word. We experiment with the feasibility of modelling human judgement on such problematical word boundaries by corpus-based mutual information. Preliminary results show that the strength of correlation between the two measures might be lexically as well as structurally dependent, and mutual information only partially models human judgement on problematic Chinese word boundaries
Keywords :
computational linguistics; Chinese lexicography; corpus-based mutual information; human judgement modelling; problematical word boundaries; statistical approach; two-character word strings; word segmentation; Art; Cities and towns; Cultural differences; Humans; Marine animals; Mutual information; Natural language processing; Natural languages; Sun; Writing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Systems, Man, and Cybernetics, 2001 IEEE International Conference on
Conference_Location :
Tucson, AZ
ISSN :
1062-922X
Print_ISBN :
0-7803-7087-2
Type :
conf
DOI :
10.1109/ICSMC.2001.972882
Filename :
972882
Link To Document :
بازگشت