Building Bilingual Parallel Corpora Based on Wikipedia

Author

Mohammadi, Mehdi ; GhasemAghaee, Nasser

Author_Institution

Dept. of Comput. Eng., Sheikh Bahaie Univ., Isfahan, Iran

Volume

2

fYear

2010

fDate

19-21 March 2010

Firstpage

264

Lastpage

268

Abstract

Aligned parallel corpora are an important resource for a wide range of multilingual researches, specifically, corpus-based machine translation. In this paper we present a Persian-English sentence-aligned parallel corpus by mining Wikipedia. We propose a method of extracting sentence-level alignment by using an extended link-based bilingual lexicon method. Experimental results show that our method increase precision, while it reduce the total number of generated candidate pairs.

Keywords

data mining; language translation; natural language processing; search engines; Persian-English sentence-aligned parallel corpus; Wikipedia mining; bilingual parallel corpora; corpus-based machine translation; extended link-based bilingual lexicon method; sentence-level alignment extraction; Application software; Biographies; Buildings; Computer applications; Concurrent computing; Dictionaries; Encyclopedias; Natural languages; Parallel processing; Wikipedia; Parallel corpora; Sentence alignment; Wikipedia;

fLanguage

English

Publisher

ieee

Conference_Titel

Computer Engineering and Applications (ICCEA), 2010 Second International Conference on

Conference_Location

Bali Island

Print_ISBN

978-1-4244-6079-3

Electronic_ISBN

978-1-4244-6080-9

Type

conf

DOI

10.1109/ICCEA.2010.203

Filename

5445653