مرکز منطقه ای اطلاع رساني علوم و فناوري - Effect of word segmentation on Arabic text classification

Abstract :

The preprocessing stage in text classification is one of the factors affecting the accuracy of text classification. Text preprocessing involves several steps such as removing stop words, punctuation, and numerals. For Arabic text classification, stemming and root extraction were proposed as additional preprocessing steps. The resulting stems and roots are then used as features for Arabic text classification. In this study, we propose word segmentation as an additional preprocessing step. We used a dataset comprising 4,900 newspaper articles evenly distributed into seven classes. We conducted our experiments on segmented and non-segmented versions of this dataset. We used chi-squared to select top-ranked features, LTC as a representation schema, and SVM as a classifier. By measuring the accuracy, precision, recall, and F-measure, we evaluated the use of word orthography as a feature for Arabic text classification before and after segmentation. In all of the experiments we conducted, the classification performance for the segmented dataset outperformed the nonsegmented dataset with the same number of features. Furthermore, we can attain the same classification performance with nonsegmented datasets using fewer features.