Title :
Building Bilingual Parallel Corpora Based on Wikipedia
Author :
Mohammadi, Mehdi ; GhasemAghaee, Nasser
Author_Institution :
Dept. of Comput. Eng., Sheikh Bahaie Univ., Isfahan, Iran
Abstract :
Aligned parallel corpora are an important resource for a wide range of multilingual researches, specifically, corpus-based machine translation. In this paper we present a Persian-English sentence-aligned parallel corpus by mining Wikipedia. We propose a method of extracting sentence-level alignment by using an extended link-based bilingual lexicon method. Experimental results show that our method increase precision, while it reduce the total number of generated candidate pairs.
Keywords :
data mining; language translation; natural language processing; search engines; Persian-English sentence-aligned parallel corpus; Wikipedia mining; bilingual parallel corpora; corpus-based machine translation; extended link-based bilingual lexicon method; sentence-level alignment extraction; Application software; Biographies; Buildings; Computer applications; Concurrent computing; Dictionaries; Encyclopedias; Natural languages; Parallel processing; Wikipedia; Parallel corpora; Sentence alignment; Wikipedia;
Conference_Titel :
Computer Engineering and Applications (ICCEA), 2010 Second International Conference on
Conference_Location :
Bali Island
Print_ISBN :
978-1-4244-6079-3
Electronic_ISBN :
978-1-4244-6080-9
DOI :
10.1109/ICCEA.2010.203