Title :
SALMA: Standard Arabic Language Morphological Analysis
Author :
Sawalha, M. ; Atwell, Eric ; Abushariah, Mohammad A. M.
Author_Institution :
Comput. Inf. Syst. Dept., Univ. of Jordan, Amman, Jordan
Abstract :
Morphological analyzers are preprocessors for text analysis. Many Text Analytics applications need them to perform their tasks. This paper reviews the SALMA-Tools (Standard Arabic Language Morphological Analysis) [1]. The SALMA-Tools is a collection of open-source standards, tools and resources that widen the scope of Arabic word structure analysis - particularly morphological analysis, to process Arabic text corpora of different domains, formats and genres, of both vowelized and non-vowelized text. Tag-assignment is significantly more complex for Arabic than for many languages. The morphological analyzer should add the appropriate linguistic information to each part or morpheme of the word (proclitic, prefix, stem, suffix and enclitic); in effect, instead of a tag for a word, we need a subtag for each part. Very fine-grained distinctions may cause problems for automatic morphosyntactic analysis - particularly probabilistic taggers which require training data, if some words can change grammatical tag depending on function and context; on the other hand, fine-grained distinctions may actually help to disambiguate other words in the local context. The SALMA - Tagger is a fine grained morphological analyzer which is mainly depends on linguistic information extracted from traditional Arabic grammar books and prior-knowledge broad-coverage lexical resources; the SALMA - ABCLexicon. More fine-grained tag sets may be more appropriate for some tasks. The SALMA - Tag Set is a standard tag set for encoding, which captures long-established traditional fine-grained morphological features of Arabic, in a notation format intended to be compact yet transparent.
Keywords :
information retrieval; linguistics; natural language processing; public domain software; text analysis; ABCLexicon; Arabic text corpora; Arabic word structure analysis; SALMA-TagSet; SALMA-Tagger; SALMA-tools; automatic morphosyntactic analysis; fine grained morphological analyzer; fine-grained tag sets; grammatical tag; linguistic information; linguistic information extraction; nonvowelized text; open source standards; prior-knowledge broad-coverage lexical resources; probabilistic taggers; standard Arabic language morphological analysis; tag-assignment; text analysis; text analytics applications; traditional Arabic grammar books; vowelized text; word morpheme; Accuracy; Algorithm design and analysis; Educational institutions; Gold; Morphology; Standards; Fine-grain; Morphological analysis; Tag Set; Traditional Arabic Grammar; Traditional Arabic Lexicons;
Conference_Titel :
Communications, Signal Processing, and their Applications (ICCSPA), 2013 1st International Conference on
Conference_Location :
Sharjah
Print_ISBN :
978-1-4673-2820-3
DOI :
10.1109/ICCSPA.2013.6487311