مرکز منطقه ای اطلاع رساني علوم و فناوري - Chinese Word Segmentation with Minimal Linguistic Knowledge: An Improved Conditional Random Fields Coupled with Character Clustering and Automatically Discovered Template Matching

DocumentCode :

2753940

Title :

Chinese Word Segmentation with Minimal Linguistic Knowledge: An Improved Conditional Random Fields Coupled with Character Clustering and Automatically Discovered Template Matching

Author :

Tsai, Richard Tzong-Han ; Dai, Hong-Jie ; Hung, Hsieh-Chuan ; Sung, Cheng-Lung ; Day, Min-Yuh ; Hsu, Wen-Lian

Author_Institution :

Acad. Sinica

fYear :

2006

fDate :

16-18 Sept. 2006

Firstpage :

274

Lastpage :

279

Abstract :

This paper addresses three major problems of closed task Chinese word segmentation (CWS): word overlap, tagging sentences interspersed with non-Chinese words, and long named entity (NE) identification. For the first, we use additional bigram features to approximate trigram and tetragram features. For the second, we first apply K-means clustering to identify non-Chinese characters. Then, we employ a two-tagger architecture: one for Chinese text and the other for non-Chinese text. Finally, we post-process our CWS output using automatically generated templates. Our results show that additional bigrams can effectively identify more unknown words. Secondly, using our two-tagger method, segmentation performance on sentences containing non-Chinese words is significantly improved when non-Chinese characters are sparse in the training corpus. Lastly, identification of long NEs and long words is also enhanced by template-based post-processing. Using corpora in closed task of SIGHAN CWS, our best system achieves F-scores of 0.956, 0.947, and 0.965 on the AS, HK, and MSR corpora respectively, compared to the best context scores of 0.952, 0.943, and 0.964 in SIGHAN Bakeoff 2005. In AS, this performance is comparable to the best result (F = 0.956) in the open task

Keywords :

computational linguistics; natural language processing; pattern clustering; random processes; text analysis; Chinese text; Chinese word segmentation; K-means clustering; bigram feature; character clustering; conditional random fields; linguistic knowledge; named entity identification; tagging sentences; template matching; tetragram feature; trigram feature; two-tagger method; word overlap; Face; Guidelines; Machine learning; Morphology; Natural languages; Particle separators; Statistics; Tagging; Testing; Training data;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Information Reuse and Integration, 2006 IEEE International Conference on

Conference_Location :

Waikoloa Village, HI

Print_ISBN :

0-7803-9788-6

Type :

conf

DOI :

10.1109/IRI.2006.252425

Filename :

4018502

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2753940