• DocumentCode
    3280705
  • Title

    Automatic Kurdish Sorani text categorization using N-gram based model

  • Author

    Mohammed, F.S. ; Zakaria, L. ; Omar, N. ; Albared, M.Y.

  • Author_Institution
    Sch. of Comput. Sci., Univ. Kebangsaan Malaysia, Bangi, Malaysia
  • Volume
    1
  • fYear
    2012
  • fDate
    12-14 June 2012
  • Firstpage
    392
  • Lastpage
    395
  • Abstract
    N-gram Based Model for text categorization is applied for many languages, in particularly the Indo-European languages family. Regrettably, there is limit study found on applying the mentioned model for Kurdish Sorani Language. This paper presents the results of investigating N-gram frequency statistics technique to classify the Kurdish Sorani Unicode documents of online newspapers into their classes. The investigated technique generates the frequency profiles for the training and the test documents using N-gram word level 1 gram and character level (2, 3, 4, 5, 6, 7, and 8) grams as a text representation. Then, a similarity algorithm called “Dice measure of similarity” is employed in order to classify the documents. Results show that the character level (5 grams) gives better text representation which is led to achieve better text classification.
  • Keywords
    data structures; electronic publishing; natural languages; pattern classification; pattern matching; statistical analysis; text analysis; word processing; Indo-European languages family; Kurdish Sorani Language; Kurdish Sorani unicode documents classification; automatic Kurdish Sorani text categorization; character level grams; dice measure of similarity; frequency profiles; n-gram frequency statistics technique; n-gram word level 1 gram; n-gram-based model; online newspapers; similarity algorithm; text representation; Art; Computational modeling; Computers; Information science; Text categorization; Training; Writing; Dice Measure of similarity; Indo-European languages family; Kurdish Sorani; N-Gram; Unicode; text categorization; text representation;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer & Information Science (ICCIS), 2012 International Conference on
  • Conference_Location
    Kuala Lumpeu
  • Print_ISBN
    978-1-4673-1937-9
  • Type

    conf

  • DOI
    10.1109/ICCISci.2012.6297277
  • Filename
    6297277