Title :
Simplified-traditional Chinese character conversion based on multi-data resources: Towards a fused conversion algorithm
Author :
Hao, Tianyong ; Zhu, Chunshen
Author_Institution :
Dept. of Chinese, Translation & Linguistics, City Univ. of Hong Kong, Hong Kong, China
Abstract :
In recent years, communication between Chinese communities in different parts of the world has been on a constant increase. However, between the traditional Chinese character used in Taiwan, Hong Kong and Macao, and the simplified Chinese character used in mainland China and Singapore, extensive differences in both formation and usage may result in unexpected hindrance in verbal communications. Though there are already a lot of conversion methods from researchers and industry companies, the precisions are still not high enough for professional usage especially on one-to-many cases. To solve this seemingly technical but actually linguistically-related problem, this paper proposes a new priority-based multi-data resources management model. With this model, conversion can be more context-sensitive, human controllable, and thus more reliable. A new algorithm called Fused Conversion Algorithm from Multi-Data resources (FCMD) is also presented. This algorithm incorporates the advantages of reverse maximum matching and N-Gram-based statistical model to render the system more responsive to contextual nuances. After parameter training on a huge LDC corpus, the conversion precision of the proposed method reaches 90.2% on one-to-many cases, which are the most difficult part in Chinese character conversion, with an overview precision rate at 99.7%. Its experimental performance in terms of precision and efficiency promises a significant improvement over the state-of-the-art models.
Keywords :
linguistics; natural languages; statistical analysis; Hong Kong; Macao; N-Gram-based statistical model; Singapore; Taiwan; fused conversion algorithm; priority-based multidata resources management model; simplified-traditional Chinese character conversion; verbal communications; Algorithm design and analysis; Data models; Dictionaries; Encyclopedias; Internet; Resource management; Training; Chinese character conversion; FCMD algorithm; multi-data resources; reverse maximum matching;
Conference_Titel :
Next Generation Information Technology (ICNIT), 2011 The 2nd International Conference on
Conference_Location :
Gyeongju
Print_ISBN :
978-1-4577-0266-2
Electronic_ISBN :
978-89-88678-39-8