An improved root extraction technique for Arabic words

Author

Al-Nashashibi, May Y. ; Neagu, D. ; Yaghi, Ali A.

Author_Institution

Dept. of Comput., Univ. of Bradford, Bradford, UK

fYear

2010

fDate

2-4 Nov. 2010

Firstpage

264

Lastpage

269

Abstract

Arabic text interpretation depends among other things on a pre-processing stage in extracting effectively a correct stem or root. We address in this work a linguistic approach for root extraction as a pre-processing step for Arabic text mining. The linguistic approach is composed of a rule-based light stemmer and a pattern-based infix remover. We propose an algorithm to handle weak, eliminated-long-vowel, hamzated, and geminated words since the linguistic approach does not handle such cases and a reasonably large portion of Arabic words in texts are irregular. The accuracy of the extracted roots is determined by comparing them with a predefined list of 5,405 triliteral and quadriliteral roots. The linguistic approach performance (with and without the proposed correction algorithm) was tested on an in-house text collection of eight categories. The proposed correction algorithm improved the accuracy of the linguistic one by about 14%.

Keywords

data mining; natural language processing; text analysis; Arabic text interpretation; Arabic text mining; Arabic words; improved root extraction technique; linguistic approach; pattern-based infix remover; rule-based light stemmer; Pragmatics; Weaving; Arabic Root Extraction; Natural Language Processing; Rule-Based Stemming; Text Mining;

fLanguage

English

Publisher

ieee

Conference_Titel

Computer Technology and Development (ICCTD), 2010 2nd International Conference on

Conference_Location

Cairo

Print_ISBN

978-1-4244-8844-5

Electronic_ISBN

978-1-4244-8845-2

Type

conf

DOI

10.1109/ICCTD.2010.5645872

Filename

5645872