Title :
New Language Resources for Arabic: Corpus Containing More Than Two Million Words and a Corpus Processing Tool
Author :
Al-Thubaity, Abdulmohsen ; Khan, Mahrukh ; Al-Mazrua, Manal ; Al-Mousa, Maram
Author_Institution :
Comput. Res. Inst., King Abdulaziz City for Sci. & Technol., Riyadh, Saudi Arabia
Abstract :
Arabic is a resource-poor language relative to other languages with a similar number of speakers. This situation negatively affects corpus-based linguistic studies in Arabic and, to a lesser extent, Arabic language processing. This paper presents a brief overview of recent freely available Arabic corpora and corpora processing tools, and it examines some of the issues that may be preventing Arabic linguists from using the same. These issues reveal the need for new language resources to enrich and foster Arabic corpus-based studies. Accordingly, this paper introduces the design of a new Arabic corpus that includes modern standard Arabic varieties based on newspapers from all Arab countries and that comprises more than two million words, it also describes the main features of a corpus processing tool specifically designed for Arabic, called "Khawas ÛæÇÕ" ("diver" in English). Khawas provides more features than any other freely available corpus processing tool for Arabic, including n-gram frequency and concordance, collocations, and statistical comparison of two corpora. Finally, we outline modifications and improvements that could be made in future works.
Keywords :
linguistics; natural language processing; publishing; statistical analysis; text analysis; Arab countries; Arabic corpora processing tools; Arabic corpus-based studies; Arabic language processing; Arabic linguists; Khawas; collocations; concordance; corpus-based linguistic studies; language resources; n-gram frequency; newspapers; resource-poor language; standard Arabic varieties; statistical comparison; Availability; Communities; Educational institutions; Internet; Pragmatics; Text categorization; Writing; Arabic concordance; Arabic corpora; Arabic language processing; N-grams; collocation; corpora comparison; language resources;
Conference_Titel :
Asian Language Processing (IALP), 2013 International Conference on
Conference_Location :
Urumqi
DOI :
10.1109/IALP.2013.21