Title :
Automatic Kurdish Sorani text categorization using N-gram based model
Author :
Mohammed, F.S. ; Zakaria, L. ; Omar, N. ; Albared, M.Y.
Author_Institution :
Sch. of Comput. Sci., Univ. Kebangsaan Malaysia, Bangi, Malaysia
Abstract :
N-gram Based Model for text categorization is applied for many languages, in particularly the Indo-European languages family. Regrettably, there is limit study found on applying the mentioned model for Kurdish Sorani Language. This paper presents the results of investigating N-gram frequency statistics technique to classify the Kurdish Sorani Unicode documents of online newspapers into their classes. The investigated technique generates the frequency profiles for the training and the test documents using N-gram word level 1 gram and character level (2, 3, 4, 5, 6, 7, and 8) grams as a text representation. Then, a similarity algorithm called “Dice measure of similarity” is employed in order to classify the documents. Results show that the character level (5 grams) gives better text representation which is led to achieve better text classification.
Keywords :
data structures; electronic publishing; natural languages; pattern classification; pattern matching; statistical analysis; text analysis; word processing; Indo-European languages family; Kurdish Sorani Language; Kurdish Sorani unicode documents classification; automatic Kurdish Sorani text categorization; character level grams; dice measure of similarity; frequency profiles; n-gram frequency statistics technique; n-gram word level 1 gram; n-gram-based model; online newspapers; similarity algorithm; text representation; Art; Computational modeling; Computers; Information science; Text categorization; Training; Writing; Dice Measure of similarity; Indo-European languages family; Kurdish Sorani; N-Gram; Unicode; text categorization; text representation;
Conference_Titel :
Computer & Information Science (ICCIS), 2012 International Conference on
Conference_Location :
Kuala Lumpeu
Print_ISBN :
978-1-4673-1937-9
DOI :
10.1109/ICCISci.2012.6297277