• DocumentCode
    2622641
  • Title

    Stemming versus Light Stemming for measuring the simitilarity between Arabic Words with Latent Semantic Analysis model

  • Author

    Froud, Hanane ; Lachkar, Abdelmonaime ; Ouatik, Said Alaoui

  • Author_Institution
    L.S.I.S., Univ. Sidi Mohamed Ben Abdellah (USMBA), Fez, Morocco
  • fYear
    2012
  • fDate
    22-24 Oct. 2012
  • Firstpage
    69
  • Lastpage
    73
  • Abstract
    Representation of semantic information contained in the words is needed for any Arabic Text Mining applications. More precisely, the purpose is to better take into account the semantic dependencies between words expressed by the co-occurrence frequencies of these words. There have been many proposals to compute similarities between words based on their distributions in contexts. In this paper, we compare and contrast the effect of two preprocessing techniques applied to Arabic corpus: Stemming, and Light Stemming techniques for measuring the semantic between Arabic words with the well known abstractive model -Latent Semantic Analysis (LSA)- with a wide variety of distance functions and similarity measures, such as the Euclidean Distance, Cosine Similarity, Jaccard Coefficient, and the Pearson Correlation Coefficient. The obtained results show that, on the one hand, the variety of the corpus produces more accurate results; on the other hand, the Light Stemming outperformed the Stemming approach because Stemming affects the words meanings.
  • Keywords
    computational geometry; data mining; natural language processing; text analysis; Arabic corpus; Arabic documents representation; Arabic text mining; Arabic words; Euclidean distance; Jaccard coefficient; Pearson correlation coefficient; co-occurrence frequencies; cosine similarity; distance functions; latent semantic analysis model; light stemming technique; semantic dependencies; semantic information representation; similarity measurement; stemming techniques; Arabic Language; Latent Semantic Analysis (LSA); Light Stemming; Similarity Measures; Stemming;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Science and Technology (CIST), 2012 Colloquium in
  • Conference_Location
    Fez
  • Print_ISBN
    978-1-4673-2726-8
  • Electronic_ISBN
    978-1-4673-2724-4
  • Type

    conf

  • DOI
    10.1109/CIST.2012.6388065
  • Filename
    6388065