• DocumentCode
    2044743
  • Title

    Building Bilingual Parallel Corpora Based on Wikipedia

  • Author

    Mohammadi, Mehdi ; GhasemAghaee, Nasser

  • Author_Institution
    Dept. of Comput. Eng., Sheikh Bahaie Univ., Isfahan, Iran
  • Volume
    2
  • fYear
    2010
  • fDate
    19-21 March 2010
  • Firstpage
    264
  • Lastpage
    268
  • Abstract
    Aligned parallel corpora are an important resource for a wide range of multilingual researches, specifically, corpus-based machine translation. In this paper we present a Persian-English sentence-aligned parallel corpus by mining Wikipedia. We propose a method of extracting sentence-level alignment by using an extended link-based bilingual lexicon method. Experimental results show that our method increase precision, while it reduce the total number of generated candidate pairs.
  • Keywords
    data mining; language translation; natural language processing; search engines; Persian-English sentence-aligned parallel corpus; Wikipedia mining; bilingual parallel corpora; corpus-based machine translation; extended link-based bilingual lexicon method; sentence-level alignment extraction; Application software; Biographies; Buildings; Computer applications; Concurrent computing; Dictionaries; Encyclopedias; Natural languages; Parallel processing; Wikipedia; Parallel corpora; Sentence alignment; Wikipedia;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Engineering and Applications (ICCEA), 2010 Second International Conference on
  • Conference_Location
    Bali Island
  • Print_ISBN
    978-1-4244-6079-3
  • Electronic_ISBN
    978-1-4244-6080-9
  • Type

    conf

  • DOI
    10.1109/ICCEA.2010.203
  • Filename
    5445653