• DocumentCode
    2910589
  • Title

    Mining Parallel Data from Comparable Corpora via Triangulation

  • Author

    Do, Thi-Ngoc-Diep ; Castelli, Eric ; Besacier, Laurent

  • Author_Institution
    MICA Center, Grenoble INP, Hanoi, Vietnam
  • fYear
    2011
  • fDate
    15-17 Nov. 2011
  • Firstpage
    185
  • Lastpage
    188
  • Abstract
    This paper improves an unsupervised method for extracting parallel sentence pairs from a comparable corpus by using the triangulation through a third language. Before, an unsupervised method for extracting parallel sentence pairs from a comparable corpus has been proposed. This method is based on technique of cross-language information retrieval with iterative process and requires no more additional parallel data. The method has been validated on the Vietnamese-French and Vietnamese-English bilingual data. In this paper, we address the problem of using triangulation through a third language to improve the parallel data mining processes: English is used in the Vietnamese-French parallel data mining process, and French is used in the Vietnamese-English parallel data mining process. The experiments conducted show that using triangulation can improve the quality of the extracted data and the quality of the translation system as well.
  • Keywords
    data mining; information retrieval; iterative methods; language translation; natural language processing; Vietnamese-English bilingual data; Vietnamese-French bilingual data; comparable corpora; cross-language information retrieval; iterative process; machine translation; parallel data mining; parallel sentence pair extraction; translation system quality; triangulation; unsupervised method; Computational linguistics; Data mining; Information filters; Noise measurement; Training; comparable corpus; extracting parallel sentence pairs; triangulation method; unsupervised method;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Asian Language Processing (IALP), 2011 International Conference on
  • Conference_Location
    Penang
  • Print_ISBN
    978-1-4577-1733-8
  • Type

    conf

  • DOI
    10.1109/IALP.2011.57
  • Filename
    6121499