Title :
Impact of corpus size and quality on English-Bangla statistical Machine Translation system
Author :
Imam, Ali Hasan ; Arman, Miah Raihan Mahmud ; Chowdhury, Shahadat Hossain ; Mahmood, Khalid
Author_Institution :
Dept. of Comput. Sci. & Inf. Technol., American Int. Univ., Bangladesh
Abstract :
Statistical machine translation (SMT) evolves with the motivation of translating a text from source language to target language which employs the machine learning technique to a parallel corpus for producing a translation system exclusively automatic. We have developed Anubad[26], a phrase-based Bangla to English SMT on the top of the SMT model proposed in [1] which is publicly available on www.anubad.com. As the most challenging task for SMT system development is the designing of large parallel corpora as the translation quality significantly depends upon the corpus dimension and quality, Bangla parallel corpus suffers the same problem and fails to provide a standard translation till now. In this paper, through simulations, we provide a guideline for developing an English-Bangla bilingual corpus. Although in a phrase-based Statistical Machine Translation systems, more training data is generally better outcome, however, we deflect from this notion and according to our experimental results, we observed that quality of good corpus could significantly improve the Bangla to English translation quality. We have found better translation quality by employing our techniques and achieved effective improvements on NIST and BLEU scores.
Keywords :
language translation; natural language processing; statistical analysis; Bangla to English; English-Bangla statistical machine translation system; SMT; corpus dimension; language source; machine learning technique; parallel corpora; parallel corpus; target language; text translation; Decoding; NIST; Natural Language Processing; Parallel Corpus; Phrase-Based Machine Translation; Statistical machine translation;
Conference_Titel :
Computer and Information Technology (ICCIT), 2011 14th International Conference on
Conference_Location :
Dhaka
Print_ISBN :
978-1-61284-907-2
DOI :
10.1109/ICCITechn.2011.6164853