DocumentCode :
3580814
Title :
Creating Indonesian-Javanese parallel corpora using wikipedia articles
Author :
Trisedya, Bayu Distiawan ; Inastra, Dyah
Author_Institution :
Fac. of Comput. Sci., Univ. Indonesia, Depok, Indonesia
fYear :
2014
Firstpage :
239
Lastpage :
245
Abstract :
Parallel corpora are necessary for multilingual researches especially in information retrieval (IR) and natural language processing (NLP). However, such corpora are hard to find, specifically for low-resources languages like ethnic languages. Parallel corpora of ethnic languages were usually collected manually. On the other hand, Wikipedia as a free online encyclopedia is supporting more and more languages each year, including ethnic languages in Indonesia. It has become one of the largest multilingual sites in World Wide Web that provides free distributed articles. In this paper, we explore a few sentence alignment methods which have been used before for another domain. We want to check whether Wikipedia can be used as one of the resources for collecting parallel corpora of Indonesian and Javanese, an ethnic language in Indonesia. We used two approaches of sentence alignment by treating Wikipedia as both parallel corpora and comparable corpora. In parallel corpora case, we used sentence length based and word correspondence methods. Meanwhile, we used the characteristics of hypertext links from Wikipedia in comparable corpora case. After the experiments, we can see that Wikipedia is useful enough for our purpose because both approaches gave positive results.
Keywords :
Web sites; encyclopaedias; public domain software; text analysis; IR; Indonesian-Javanese parallel corpora; NLP; Wikipedia articles; World Wide Web; comparable corpora; ethnic languages; free distributed articles; free online encyclopedia; hypertext link characteristics; information retrieval; low-resource languages; multilingual researches; multilingual sites; natural language processing; sentence alignment methods; sentence length based method; word correspondence method; Abstracts; Computer science; Decision support systems; Electronic publishing; Encyclopedias; Handheld computers;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Advanced Computer Science and Information Systems (ICACSIS), 2014 International Conference on
Type :
conf
DOI :
10.1109/ICACSIS.2014.7065828
Filename :
7065828
Link To Document :
بازگشت