DocumentCode :
1909620
Title :
Using Mutual Information Criterion to Design an Effective Lexicon for Chinese Pinyin-to-Character Conversion
Author :
Wei Li ; Jinsong Zhang ; Yanlu Xie ; Xiaoyun Wang ; Nishida, Masanori ; Yamamoto, Seiichi
Author_Institution :
Beijing Language & Culture Univ., Beijing, China
fYear :
2013
fDate :
17-19 Aug. 2013
Firstpage :
269
Lastpage :
272
Abstract :
Pinyin-to-character (P2C) conversion is mostly used to input Chinese characters into a computer. Its main problem is homophone words, which is solved through exploiting contextual information provided by lexicon and n-gram language model (LM). Our investigation about the state-of-the-art P2C technologies reveals that the methods of conventional optimization for them were almost based on minimizing text perplexity, however it is not directly related to the optimization of P2C performance. Therefore, we propose to use a new optimization criterion: mutual information (MI) between text corpus and its Pinyin script, to do self-supervised word segmentation, build a lexicon and estimate an n-gram LM, then use them to build P2C system. We realized the P2C system using newspaper corpus. Compared with the two baseline systems using handcrafted lexicon and perplexity based optimized lexicon, our system got relatively 19.7% and 10.3% error reductions on testing corpus respectively. The results show the efficiency of our proposal.
Keywords :
computational linguistics; natural language processing; optimisation; Chinese character; Chinese pinyin-to-character conversion; P2C conversion; P2C technology; Pinyin script; contextual information; homophone word; lexicon; mutual information criterion; n-gram language model; newspaper corpus; self-supervised word segmentation; text perplexity; Entropy; Equations; Mathematical model; Mutual information; Optimization; Testing; Training; Language model; Mutual information; Pinyin-to-Character Conversion;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Asian Language Processing (IALP), 2013 International Conference on
Conference_Location :
Urumqi
Type :
conf
DOI :
10.1109/IALP.2013.37
Filename :
6646052
Link To Document :
بازگشت