Which performs better for new word detection, character based or Chinese Word Segmentation based?

Author

Haijun Zhang ; Shumin Shi

Author_Institution

Sch. of Comput. Sci. & Technol., Xinjiang Normal Univ., Urumqi, China

fYear

2014

fDate

20-22 Oct. 2014

Firstpage

10

Lastpage

14

Abstract

This paper proposed a novel method to evaluate the performance of New Word Detection (NWD) based on repeats extraction. For small-scale corpus, we put forward employing Conditional Random Field (CRF) as statistical framework to estimate the effects of different strategies of NWD. For the situations of large-scale corpus, as there is no infinity of annotated corpus, comparative experiments are unable to carry out evaluation. Accordingly, this paper proposed a pragmatic quantitative model to analyze and estimate the performance of NWD for all kinds of cases, especially for large-scale corpus situation. Studies have shown there is a good mutual authentication between experimental results and conclusion from the quantitative model. On the basis of analysis for experimental data and quantitative model, a reliable conclusion for effects of Chinese NWD basing the two strategies is reached, which can give a certain instruction for follow-up studies in Chinese new word detection.

Keywords

natural language processing; random processes; statistical analysis; CRF; Chinese NWD; Chinese new word detection; Chinese word segmentation based; annotated corpus; character based; conditional random field; large-scale corpus situation; mutual authentication; performance estimation; pragmatic quantitative model; repeats extraction; statistical framework; Analytical models; Data models; Dictionaries; Educational institutions; Feature extraction; Pragmatics; Support vector machines; CRF; Character Based; Chinese Word Segmentation; New Words Detection; Repeats;

fLanguage

English

Publisher

ieee

Conference_Titel

Asian Language Processing (IALP), 2014 International Conference on

Conference_Location

Kuching

Type

conf

DOI

10.1109/IALP.2014.6973474

Filename

6973474