DocumentCode :
3496711
Title :
Impact of corpus size and quality on English-Bangla statistical Machine Translation system
Author :
Imam, Ali Hasan ; Arman, Miah Raihan Mahmud ; Chowdhury, Shahadat Hossain ; Mahmood, Khalid
Author_Institution :
Dept. of Comput. Sci. & Inf. Technol., American Int. Univ., Bangladesh
fYear :
2011
fDate :
22-24 Dec. 2011
Firstpage :
566
Lastpage :
571
Abstract :
Statistical machine translation (SMT) evolves with the motivation of translating a text from source language to target language which employs the machine learning technique to a parallel corpus for producing a translation system exclusively automatic. We have developed Anubad[26], a phrase-based Bangla to English SMT on the top of the SMT model proposed in [1] which is publicly available on www.anubad.com. As the most challenging task for SMT system development is the designing of large parallel corpora as the translation quality significantly depends upon the corpus dimension and quality, Bangla parallel corpus suffers the same problem and fails to provide a standard translation till now. In this paper, through simulations, we provide a guideline for developing an English-Bangla bilingual corpus. Although in a phrase-based Statistical Machine Translation systems, more training data is generally better outcome, however, we deflect from this notion and according to our experimental results, we observed that quality of good corpus could significantly improve the Bangla to English translation quality. We have found better translation quality by employing our techniques and achieved effective improvements on NIST and BLEU scores.
Keywords :
language translation; natural language processing; statistical analysis; Bangla to English; English-Bangla statistical machine translation system; SMT; corpus dimension; language source; machine learning technique; parallel corpora; parallel corpus; target language; text translation; Decoding; NIST; Natural Language Processing; Parallel Corpus; Phrase-Based Machine Translation; Statistical machine translation;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer and Information Technology (ICCIT), 2011 14th International Conference on
Conference_Location :
Dhaka
Print_ISBN :
978-1-61284-907-2
Type :
conf
DOI :
10.1109/ICCITechn.2011.6164853
Filename :
6164853
Link To Document :
بازگشت