DocumentCode :
2760771
Title :
A discriminative approach to filter out noisy sentence pairs from bilingual corpora
Author :
Taghipour, Kaveh ; Afhami, Nasim ; Khadivi, Shahram ; Shiry, Saeed
Author_Institution :
Dept. of Comput. Eng., Amirkabir Univ. of Technol., Tehran, Iran
fYear :
2010
fDate :
4-6 Dec. 2010
Firstpage :
537
Lastpage :
541
Abstract :
Parallel corpora are essential for training statistical machine translation models. Since parallel sentence-aligned corpora are usually noisy due to inexact automatic methods when generated from parallel or comparable documents, we need to clean parallel corpora. In this paper, new features are introduced to assess the correctness of a sentence pair. Also, the impact of new features in combination with state-of-the-art features introduced in the literature is systematically evaluated. Statistical methods have been used for feature extraction and therefore this approach is independent to language. In order to better understand the problem characteristics, four supervised classification algorithms are used to classify sentence pairs as noise or parallel. Evaluating the models by taking accuracy and f-measure into account shows that using the system for cleaning a noisy parallel Farsi-English corpus, the maximum entropy model performs better than the main filtering techniques used in this paper and shows a significant improvement over two other systems.
Keywords :
language translation; maximum entropy methods; pattern classification; statistical analysis; F-measurement; bilingual corpora; discriminative approach; maximum entropy model; noisy parallel Farsi-English corpus; noisy sentence pairs; parallel sentence-aligned corpora; sentence pair correctness; statistical machine translation models; statistical methods; supervised classification algorithms; Accuracy; Classification algorithms; Computational linguistics; Entropy; Noise; Noise measurement; Training; Corpus Filtering; Cross Language Information Retrieval; Maximum Entropy; Statistical Machine Translation;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Telecommunications (IST), 2010 5th International Symposium on
Conference_Location :
Tehran
Print_ISBN :
978-1-4244-8183-5
Electronic_ISBN :
978-1-4244-8184-2
Type :
conf
DOI :
10.1109/ISTEL.2010.5734083
Filename :
5734083
Link To Document :
بازگشت