DocumentCode :
1994164
Title :
Bilingual Segmenter for Statistical Machine Translation
Author :
Huang, Chung-Chi ; Chen, Wei-teh ; Chang, Jason S.
Author_Institution :
ISA, NTHU, Hsinchu, Taiwan
fYear :
2008
fDate :
15-16 Dec. 2008
Firstpage :
97
Lastpage :
104
Abstract :
We propose a bilingually-motivated segmenting framework for Chinese which has no clear delimiter for word boundaries. It involves producing Chinese tokens in line with word-based languages¿ words using a bilingual segmenting algorithm, provided with bitexts, and deriving a probabilistic tokenizing model based on previously annotated Chinese sentences. In the bilingual segmenting algorithm, we first convert the search for segmentation into a sequential tagging problem, allowing for a polynomial-time dynamic programming solution, and incorporate a control to balance mono- and bi-lingual information in tailoring Chinese sentences. Experiments show that our framework, applied as a pre-tokenization component, significantly outperforms existing segmenters in translation quality, suggesting our methodology supports better segmentation for bilingual NLP applications involving isolated languages such as Chinese.
Keywords :
computational complexity; computational linguistics; dynamic programming; language translation; natural language processing; probability; Chinese bilingual segmenting algorithm; Chinese sentence; Chinese token; bitext; natural language processing; polynomial-time dynamic programming solution; probabilistic tokenizing model; sequential tagging problem; statistical machine translation; word-based language; Decoding; Dynamic programming; Instruction sets; Natural language processing; Natural languages; Performance analysis; Polynomials; Probability; Tagging; White spaces; bilingual segmenter; conditional random fields; machine translation; phrase-based decoder; word alignment;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Universal Communication, 2008. ISUC '08. Second International Symposium on
Conference_Location :
Osaka
Print_ISBN :
978-0-7695-3433-6
Type :
conf
DOI :
10.1109/ISUC.2008.10
Filename :
4724447
Link To Document :
بازگشت