DocumentCode
2910589
Title
Mining Parallel Data from Comparable Corpora via Triangulation
Author
Do, Thi-Ngoc-Diep ; Castelli, Eric ; Besacier, Laurent
Author_Institution
MICA Center, Grenoble INP, Hanoi, Vietnam
fYear
2011
fDate
15-17 Nov. 2011
Firstpage
185
Lastpage
188
Abstract
This paper improves an unsupervised method for extracting parallel sentence pairs from a comparable corpus by using the triangulation through a third language. Before, an unsupervised method for extracting parallel sentence pairs from a comparable corpus has been proposed. This method is based on technique of cross-language information retrieval with iterative process and requires no more additional parallel data. The method has been validated on the Vietnamese-French and Vietnamese-English bilingual data. In this paper, we address the problem of using triangulation through a third language to improve the parallel data mining processes: English is used in the Vietnamese-French parallel data mining process, and French is used in the Vietnamese-English parallel data mining process. The experiments conducted show that using triangulation can improve the quality of the extracted data and the quality of the translation system as well.
Keywords
data mining; information retrieval; iterative methods; language translation; natural language processing; Vietnamese-English bilingual data; Vietnamese-French bilingual data; comparable corpora; cross-language information retrieval; iterative process; machine translation; parallel data mining; parallel sentence pair extraction; translation system quality; triangulation; unsupervised method; Computational linguistics; Data mining; Information filters; Noise measurement; Training; comparable corpus; extracting parallel sentence pairs; triangulation method; unsupervised method;
fLanguage
English
Publisher
ieee
Conference_Titel
Asian Language Processing (IALP), 2011 International Conference on
Conference_Location
Penang
Print_ISBN
978-1-4577-1733-8
Type
conf
DOI
10.1109/IALP.2011.57
Filename
6121499
Link To Document