Title :
Extracting parallel phrases from comparable corpora
Author :
Jiexin Zhang ; Hailong Cao ; Tiejun Zhao
Author_Institution :
Sch. of Cornputer Sci. & Technol., Harbin Inst. of Technol., Harbin, China
Abstract :
The state-of-the-art statistical machine translation models are trained with the parallel corpora. However, the traditional SMT loses its power when it comes to language pairs with few bilingual resources. This paper proposes a novel method that treats the phrase extraction as a classification task. We first automatically generate the training and testing phrase pairs for the classifier. Then, we train a SVM classifier which can determine the phrase pairs are either parallel or non-parallel. The proposed approach is evaluated on the translation task of Chinese-English. Experimental results show that the precision of the classifier on test sets is above 70% and the accuracy is above 98% The quality of the extracted data is also evaluated by measuring the impact on the performance of a state-of-the-art SMT system, which is built with a small parallel corpus. It shows better results over the baseline system.
Keywords :
language translation; natural language processing; pattern classification; performance evaluation; support vector machines; Chinese-English translation; SMT; SVM classifier; bilingual resources; classification task; comparable corpora; language pairs; parallel corpora; parallel phrases; performance evaluation; phrase extraction; statistical machine translation model; testing phrase pair; training phrase pair; translation task; Computational linguistics; Data mining; Feature extraction; Support vector machines; Testing; Training; Training data; Statistical Machine Translation; Support Vector Machine; classification; comparable corpus;
Conference_Titel :
Asian Language Processing (IALP), 2014 International Conference on
Conference_Location :
Kuching
DOI :
10.1109/IALP.2014.6973501