• DocumentCode
    3582773
  • Title

    Compression-based arabic text classification

  • Author

    Ta´amneh, Haneen ; Abu Keshek, Ehsan ; Issa, Manar Bani ; Al-Ayyoub, Mahmoud ; Jararweh, Yaser

  • Author_Institution
    Jordan Univ. of Sci. & Technol., Irbid, Jordan
  • fYear
    2014
  • Firstpage
    594
  • Lastpage
    600
  • Abstract
    Text classification (TC) is one of the fundamental problems in text mining. Plenty of works exist on TC with interesting approaches and excellent results; however, most of these works follow a word-based approach for feature extraction. In this work, we are interested in an alternative (byte-based or character-based) approach known as compression-based TC (CTC). CTC has been used for some languages such as English and Portuguese and it is shown to have certain advantages/ disadvantages compared with word-based approaches. This work applies CTC on the Arabic language with the purpose of investigating whether these advantages/disadvantages exists for the Arabic language as well. The results are encouraging as they show the viability of using CTC for Arabic TC.
  • Keywords
    classification; data mining; feature extraction; natural language processing; text analysis; Arabic language; CTC; English; Portuguese; compression-based Arabic text classification; compression-based TC; feature extraction; text mining; word-based approach; Accuracy; Compression algorithms; Dictionaries; Natural language processing; Niobium; Testing; Training;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Systems and Applications (AICCSA), 2014 IEEE/ACS 11th International Conference on
  • Type

    conf

  • DOI
    10.1109/AICCSA.2014.7073253
  • Filename
    7073253