DocumentCode
172472
Title
Which performs better for new word detection, character based or Chinese Word Segmentation based?
Author
Haijun Zhang ; Shumin Shi
Author_Institution
Sch. of Comput. Sci. & Technol., Xinjiang Normal Univ., Urumqi, China
fYear
2014
fDate
20-22 Oct. 2014
Firstpage
10
Lastpage
14
Abstract
This paper proposed a novel method to evaluate the performance of New Word Detection (NWD) based on repeats extraction. For small-scale corpus, we put forward employing Conditional Random Field (CRF) as statistical framework to estimate the effects of different strategies of NWD. For the situations of large-scale corpus, as there is no infinity of annotated corpus, comparative experiments are unable to carry out evaluation. Accordingly, this paper proposed a pragmatic quantitative model to analyze and estimate the performance of NWD for all kinds of cases, especially for large-scale corpus situation. Studies have shown there is a good mutual authentication between experimental results and conclusion from the quantitative model. On the basis of analysis for experimental data and quantitative model, a reliable conclusion for effects of Chinese NWD basing the two strategies is reached, which can give a certain instruction for follow-up studies in Chinese new word detection.
Keywords
natural language processing; random processes; statistical analysis; CRF; Chinese NWD; Chinese new word detection; Chinese word segmentation based; annotated corpus; character based; conditional random field; large-scale corpus situation; mutual authentication; performance estimation; pragmatic quantitative model; repeats extraction; statistical framework; Analytical models; Data models; Dictionaries; Educational institutions; Feature extraction; Pragmatics; Support vector machines; CRF; Character Based; Chinese Word Segmentation; New Words Detection; Repeats;
fLanguage
English
Publisher
ieee
Conference_Titel
Asian Language Processing (IALP), 2014 International Conference on
Conference_Location
Kuching
Type
conf
DOI
10.1109/IALP.2014.6973474
Filename
6973474
Link To Document