Title :
A Maximum-Entropy Segmentation Model for Statistical Machine Translation
Author :
Deyi Xiong ; Min Zhang ; Haizhou Li
Author_Institution :
Dept. of Human Language Technol., Inst. for Infocomm Res., Singapore, Singapore
Abstract :
Segmentation is of great importance to statistical machine translation. It splits a source sentence into sequences of translatable segments. We propose a maximum-entropy segmentation model to capture desirable phrasal and hierarchical segmentations for statistical machine translation. We present an approach to automatically learning the beginning and ending boundaries of cohesive segments from word-aligned bilingual data without using any additional resources. The learned boundaries are then used to define cohesive segments in both phrasal and hierarchical segmentations. We integrate the segmentation model into phrasal statistical machine translation (SMT) and conduct experiments on the newswire and broadcast news domain to investigate the effectiveness of the proposed segmentation model on a large-scale training data. Our experimental results show that the maximum-entropy segmentation model significantly improves translation quality in terms of BLEU. We further validate that 1) the proposed segmentation model significantly outperforms syntactic constraints which are used in previous work to constrain segmentations; and 2) it is necessary to capture hierarchical segmentations besides phrasal segmentations.
Keywords :
computational linguistics; language translation; maximum entropy methods; statistical analysis; SMT; cohesive segments; desirable phrasal segmentations; hierarchical segmentations; large-scale training data; learned boundary; maximum-entropy segmentation model; phrasal statistical machine translation; source sentence; syntactic constraints; translatable segments; translation quality; word-aligned bilingual data; Decoding; Entropy; Feature extraction; Syntactics; Training; Training data; Bracketing transduction grammar (BTG)-based phrasal machine translation; hierarchical segmentation; maximum entropy; phrasal segmentation; statistical machine translation (SMT);
Journal_Title :
Audio, Speech, and Language Processing, IEEE Transactions on
Conference_Location :
4/21/2011 12:00:00 AM
DOI :
10.1109/TASL.2011.2144971