DocumentCode :
657579
Title :
Developing an Arabic corpus for event mining
Author :
Alasfour, Abdel Alnasser A. ; Trausan-Matu, Stefan
Author_Institution :
Comput. Sci. Dept., Politeh. Univ. of Bucharest, Bucharest, Romania
fYear :
2013
fDate :
11-13 Oct. 2013
Firstpage :
21
Lastpage :
28
Abstract :
Recently, Arabic Natural Language Processing (A-NLP) is beginning to gain more interest. Corpora in general, have become a dependable resource for Language Engineering including Information Retrieval, Machine Translation and other Natural Language-related disciplines. As a result, many Arabic corpora have been developed and most of them are available online for linguistics´ researchers. For example, the Agence France-Press (AFP) corpus is an Arabic newswire developed by the Linguistic Data Consortium (LDC) [1,8] and the Quranic Arabic corpus organized by the University of Leeds [5]. For any objective research in NLP, there must be a corpus covering most of the language patterns in variant domains [21]. But, over the years, different new jargons have appeared within the Arabic speaking states. In this paper, a modern standard Arabic is used to avoid any region specific Arabic language patterns [1]. The Organization of Islamic Cooperation (OIC) is selected as a main data source. OIC is the second largest inter-governmental organization after the United Nations, comprising of 57 member states in four continents. Some data is also taken from International Islamic News Agency (IINA). IINA is the informational side of the OIC, working as an electronic newspaper, having electronic categorization of news documents. In future, this corpus will be a part of parallel corpus (Arabic - English). For that reason, we have selected sites with the ability of parallel multilingual document Arabic and English.
Keywords :
data mining; linguistics; natural language processing; A-NLP; AFP corpus; Agence France-Press; Arabic corpora; Arabic language patterns; Arabic natural language processing; Arabic newswire; Arabic speaking states; Arabic-English corpus; English language; IINA; International Islamic News Agency; LDC; Linguistic Data Consortium; OIC; Organization of Islamic Cooperation; Quranic Arabic corpus; United Nations; University of Leeds; electronic categorization; electronic newspaper; event mining; inter-governmental organization; jargons; language engineering; modern standard Arabic; news documents; parallel corpus; parallel multilingual document; Educational institutions; HTML; Internet; Natural language processing; Pragmatics; Standards organizations; Web pages; A-NLP; Corpus; Event Mining; Extraction;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
System Theory, Control and Computing (ICSTCC), 2013 17th International Conference
Conference_Location :
Sinaia
Print_ISBN :
978-1-4799-2227-7
Type :
conf
DOI :
10.1109/ICSTCC.2013.6688930
Filename :
6688930
Link To Document :
بازگشت