Improving Arabic document categorization: Introducing local stem

Author

Al-Shammari, Eiman Tamah

Author_Institution

Kuwait Univ., Safat, Kuwait

fYear

2010

fDate

Nov. 29 2010-Dec. 1 2010

Firstpage

385

Lastpage

390

Abstract

Stemming is a fundamental step in processing textual data preceding the tasks of text mining, Information Retrieval (IR), and natural language processing (NLP). The common goal of stemming is to standardize words by reducing a word to its base (root or stem), thus can be also considered a feature reduction technique. This paper aims at presenting a new dictionary free, content-based Arabic stemmer and adopts it as a feature reduction (selection) mechanism to study its contribution in improving Arabic text categorization. We employed three stemming mechanisms (root-based, light, and our stemming technique and assessed their performance in text classification exercises for an Arabic corpus to compare and contrast the text mining effectiveness of these Arabic stemming algorithms. The experiments were conducted on a corpus consisting of 2,966 Arabic documents that fall into three categories: cultural, social, and general. The experiment results showed that our stemmer significantly improved text classification accuracy.

Keywords

data mining; pattern classification; text analysis; Arabic document categorization; Arabic stemming algorithms; Arabic text categorization; content-based Arabic stemmer; dictionary free Arabic stemmer; feature reduction technique; light stemming mechanism; root-based stemming mechanism; text classification; text mining; textual data processing; Classification; Stemming; Text Mining;

fLanguage

English

Publisher

ieee

Conference_Titel

Intelligent Systems Design and Applications (ISDA), 2010 10th International Conference on

Conference_Location

Cairo

Print_ISBN

978-1-4244-8134-7

Type

conf

DOI

10.1109/ISDA.2010.5687235

Filename

5687235