DocumentCode
3582773
Title
Compression-based arabic text classification
Author
Ta´amneh, Haneen ; Abu Keshek, Ehsan ; Issa, Manar Bani ; Al-Ayyoub, Mahmoud ; Jararweh, Yaser
Author_Institution
Jordan Univ. of Sci. & Technol., Irbid, Jordan
fYear
2014
Firstpage
594
Lastpage
600
Abstract
Text classification (TC) is one of the fundamental problems in text mining. Plenty of works exist on TC with interesting approaches and excellent results; however, most of these works follow a word-based approach for feature extraction. In this work, we are interested in an alternative (byte-based or character-based) approach known as compression-based TC (CTC). CTC has been used for some languages such as English and Portuguese and it is shown to have certain advantages/ disadvantages compared with word-based approaches. This work applies CTC on the Arabic language with the purpose of investigating whether these advantages/disadvantages exists for the Arabic language as well. The results are encouraging as they show the viability of using CTC for Arabic TC.
Keywords
classification; data mining; feature extraction; natural language processing; text analysis; Arabic language; CTC; English; Portuguese; compression-based Arabic text classification; compression-based TC; feature extraction; text mining; word-based approach; Accuracy; Compression algorithms; Dictionaries; Natural language processing; Niobium; Testing; Training;
fLanguage
English
Publisher
ieee
Conference_Titel
Computer Systems and Applications (AICCSA), 2014 IEEE/ACS 11th International Conference on
Type
conf
DOI
10.1109/AICCSA.2014.7073253
Filename
7073253
Link To Document