DocumentCode :
3767539
Title :
Effect of word segmentation on Arabic text classification
Author :
Abdulmohsen Al-Thubaity;Abdullah Al-Subaie
Author_Institution :
The National Center for Computer Technology and Applied Math, King Abdulaziz City for Science and Technology, Riyadh, KSA
fYear :
2015
Firstpage :
127
Lastpage :
131
Abstract :
The preprocessing stage in text classification is one of the factors affecting the accuracy of text classification. Text preprocessing involves several steps such as removing stop words, punctuation, and numerals. For Arabic text classification, stemming and root extraction were proposed as additional preprocessing steps. The resulting stems and roots are then used as features for Arabic text classification. In this study, we propose word segmentation as an additional preprocessing step. We used a dataset comprising 4,900 newspaper articles evenly distributed into seven classes. We conducted our experiments on segmented and non-segmented versions of this dataset. We used chi-squared to select top-ranked features, LTC as a representation schema, and SVM as a classifier. By measuring the accuracy, precision, recall, and F-measure, we evaluated the use of word orthography as a feature for Arabic text classification before and after segmentation. In all of the experiments we conducted, the classification performance for the segmented dataset outperformed the nonsegmented dataset with the same number of features. Furthermore, we can attain the same classification performance with nonsegmented datasets using fewer features.
Keywords :
Classification algorithms
Publisher :
ieee
Conference_Titel :
Asian Language Processing (IALP), 2015 International Conference on
Print_ISBN :
978-1-4673-9595-3
Type :
conf
DOI :
10.1109/IALP.2015.7451548
Filename :
7451548
Link To Document :
بازگشت