DocumentCode
3280705
Title
Automatic Kurdish Sorani text categorization using N-gram based model
Author
Mohammed, F.S. ; Zakaria, L. ; Omar, N. ; Albared, M.Y.
Author_Institution
Sch. of Comput. Sci., Univ. Kebangsaan Malaysia, Bangi, Malaysia
Volume
1
fYear
2012
fDate
12-14 June 2012
Firstpage
392
Lastpage
395
Abstract
N-gram Based Model for text categorization is applied for many languages, in particularly the Indo-European languages family. Regrettably, there is limit study found on applying the mentioned model for Kurdish Sorani Language. This paper presents the results of investigating N-gram frequency statistics technique to classify the Kurdish Sorani Unicode documents of online newspapers into their classes. The investigated technique generates the frequency profiles for the training and the test documents using N-gram word level 1 gram and character level (2, 3, 4, 5, 6, 7, and 8) grams as a text representation. Then, a similarity algorithm called “Dice measure of similarity” is employed in order to classify the documents. Results show that the character level (5 grams) gives better text representation which is led to achieve better text classification.
Keywords
data structures; electronic publishing; natural languages; pattern classification; pattern matching; statistical analysis; text analysis; word processing; Indo-European languages family; Kurdish Sorani Language; Kurdish Sorani unicode documents classification; automatic Kurdish Sorani text categorization; character level grams; dice measure of similarity; frequency profiles; n-gram frequency statistics technique; n-gram word level 1 gram; n-gram-based model; online newspapers; similarity algorithm; text representation; Art; Computational modeling; Computers; Information science; Text categorization; Training; Writing; Dice Measure of similarity; Indo-European languages family; Kurdish Sorani; N-Gram; Unicode; text categorization; text representation;
fLanguage
English
Publisher
ieee
Conference_Titel
Computer & Information Science (ICCIS), 2012 International Conference on
Conference_Location
Kuala Lumpeu
Print_ISBN
978-1-4673-1937-9
Type
conf
DOI
10.1109/ICCISci.2012.6297277
Filename
6297277
Link To Document