Improving Chinese Chunking with Enriched Statistical and Morphological Knowledge

Author

Yao, Limin ; Li, Mu ; Huang, Changning

Author_Institution

Tsinghua Univ., Beijing

fYear

2007

fDate

Aug. 30 2007-Sept. 1 2007

Firstpage

149

Lastpage

156

Abstract

In this paper, we address the issue of improving a Chinese chunking system with rich lexicalized information. A method that incorporates statistical information based on distributional similarity between words obtained from large unlabeled corpus and morphological knowledge into a state-of-the-art CRF-based chunking model is proposed to tackle the data sparseness problem given limited amount of labeled training data. Evaluations are performed on the latest release of Chinese Treebank, and experimental results show that our method outperforms the chunking models based on features over word and automatically assigned POS tags when using the same amount of training data.

Keywords

natural language processing; random processes; statistical analysis; Chinese Treebank; Chinese chunking; conditional random field model; data sparseness problem; morphological knowledge; statistical knowledge; Asia; Chromium; Data mining; Lead; Natural languages; Tagging; Training data; Tree data structures;

fLanguage

English

Publisher

ieee

Conference_Titel

Natural Language Processing and Knowledge Engineering, 2007. NLP-KE 2007. International Conference on

Conference_Location

Beijing

Print_ISBN

978-1-4244-1611-0

Electronic_ISBN

978-1-4244-1611-0

Type

conf

DOI

10.1109/NLPKE.2007.4368026

Filename

4368026