• DocumentCode
    3238826
  • Title

    An improved root extraction technique for Arabic words

  • Author

    Al-Nashashibi, May Y. ; Neagu, D. ; Yaghi, Ali A.

  • Author_Institution
    Dept. of Comput., Univ. of Bradford, Bradford, UK
  • fYear
    2010
  • fDate
    2-4 Nov. 2010
  • Firstpage
    264
  • Lastpage
    269
  • Abstract
    Arabic text interpretation depends among other things on a pre-processing stage in extracting effectively a correct stem or root. We address in this work a linguistic approach for root extraction as a pre-processing step for Arabic text mining. The linguistic approach is composed of a rule-based light stemmer and a pattern-based infix remover. We propose an algorithm to handle weak, eliminated-long-vowel, hamzated, and geminated words since the linguistic approach does not handle such cases and a reasonably large portion of Arabic words in texts are irregular. The accuracy of the extracted roots is determined by comparing them with a predefined list of 5,405 triliteral and quadriliteral roots. The linguistic approach performance (with and without the proposed correction algorithm) was tested on an in-house text collection of eight categories. The proposed correction algorithm improved the accuracy of the linguistic one by about 14%.
  • Keywords
    data mining; natural language processing; text analysis; Arabic text interpretation; Arabic text mining; Arabic words; improved root extraction technique; linguistic approach; pattern-based infix remover; rule-based light stemmer; Pragmatics; Weaving; Arabic Root Extraction; Natural Language Processing; Rule-Based Stemming; Text Mining;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Technology and Development (ICCTD), 2010 2nd International Conference on
  • Conference_Location
    Cairo
  • Print_ISBN
    978-1-4244-8844-5
  • Electronic_ISBN
    978-1-4244-8845-2
  • Type

    conf

  • DOI
    10.1109/ICCTD.2010.5645872
  • Filename
    5645872