Title :
Which performs better for new word detection, character based or Chinese Word Segmentation based?
Author :
Haijun Zhang ; Shumin Shi
Author_Institution :
Sch. of Comput. Sci. & Technol., Xinjiang Normal Univ., Urumqi, China
Abstract :
This paper proposed a novel method to evaluate the performance of New Word Detection (NWD) based on repeats extraction. For small-scale corpus, we put forward employing Conditional Random Field (CRF) as statistical framework to estimate the effects of different strategies of NWD. For the situations of large-scale corpus, as there is no infinity of annotated corpus, comparative experiments are unable to carry out evaluation. Accordingly, this paper proposed a pragmatic quantitative model to analyze and estimate the performance of NWD for all kinds of cases, especially for large-scale corpus situation. Studies have shown there is a good mutual authentication between experimental results and conclusion from the quantitative model. On the basis of analysis for experimental data and quantitative model, a reliable conclusion for effects of Chinese NWD basing the two strategies is reached, which can give a certain instruction for follow-up studies in Chinese new word detection.
Keywords :
natural language processing; random processes; statistical analysis; CRF; Chinese NWD; Chinese new word detection; Chinese word segmentation based; annotated corpus; character based; conditional random field; large-scale corpus situation; mutual authentication; performance estimation; pragmatic quantitative model; repeats extraction; statistical framework; Analytical models; Data models; Dictionaries; Educational institutions; Feature extraction; Pragmatics; Support vector machines; CRF; Character Based; Chinese Word Segmentation; New Words Detection; Repeats;
Conference_Titel :
Asian Language Processing (IALP), 2014 International Conference on
Conference_Location :
Kuching
DOI :
10.1109/IALP.2014.6973474