Applying machine learning algorithms for automatic Persian text classification

Author

Farhoodi, Mojgan ; Yari, Alireza

Author_Institution

Iran Telecommun. Res. Center, Iran

fYear

2010

fDate

Nov. 30 2010-Dec. 2 2010

Firstpage

318

Lastpage

323

Abstract

Automatic document classification due to its various applications in data mining and information technology is one of the important topics in computer science. Classification plays a vital role in many information management and retrieval tasks. Document classification, also known as document categorization, is the process of assigning a document to one or more predefined category labels. Classification is often posed as a supervised learning problem in which a set of labeled data is used to train a classifier which can be applied to label future examples. Document classification includes different parts such as text processing, feature extraction, feature vector construction and final classification. Thus improvement in each part should lead to better results in document classification. In this paper, we apply machine learning methods for automatic Persian news classification. In this regard, we first try to exert some language preprocess in Hamshahri dataset, and then we extract a feature vector for each news text by using feature weighting and feature selection algorithms. After that we train our classifier by support vector machine (SVM) and K-nearest neighbor (KNN) algorithms. In Experiments, although both algorithms show acceptable results for Persian text classification, the performance of KNN is better in comparison to SVM.

Keywords

data mining; feature extraction; information retrieval; learning (artificial intelligence); natural language processing; pattern classification; support vector machines; text analysis; Hamshahri dataset; K-nearest neighbor algorithm; automatic Persian text classification; automatic document classification; data mining; document categorization; document classification; feature extraction; feature selection algorithm; feature vector construction; feature weighting; information management; information retrieval; information technology; machine learning algorithm; supervised learning problem; support vector machine; text processing; Classification algorithms; Feature extraction; Kernel; Machine learning algorithms; Support vector machine classification; Text categorization; Hamshahri; KNN; SVM; feature selection; machine learning; text classification;

fLanguage

English

Publisher

ieee

Conference_Titel

Advanced Information Management and Service (IMS), 2010 6th International Conference on

Conference_Location

Seoul

Print_ISBN

978-1-4244-8599-4

Electronic_ISBN

978-89-88678-32-9

Type

conf

Filename

5713467