Title :
Research on the feature selection techniques used in text classification
Author :
Li, Yan ; Chen, Chungang
Author_Institution :
Sch. of Comput. Sci. & Eng., Xi´´an Univ. of Technol., Xi´´an, China
Abstract :
With the ever-increasing number of digital documents, the ability to automatically classify those documents both quickly and accurately is becoming more critical and difficult. A text classification system for Chinese documents is developed in this paper. A HTF-WDF algorithm is proposed for feature selection. Different from other feature selection algorithms, this method considers the effect of term frequency. Using the idea of fuzzy feature, the terms with high term frequency (HTF) are distinguished and appended to the feature list. The features which can represent the topic of the documents are picked out according to the weighted document frequencies (WDF), which can avoid the problems of the traditional document frequency (DF) method. Then the Support Vector Machine (SVM) is used to training the classifier. The proposed algorithm is verified by representative Chinese documents. The experiment results manifest the superiority of the proposed algorithm to the traditional DF algorithm.
Keywords :
fuzzy set theory; natural language processing; pattern classification; support vector machines; text analysis; digital document; document classification; feature selection technique; fuzzy feature; high term frequency; representative Chinese document verification; support vector machine; text classification system; weighted document frequency; Accuracy; Algorithm design and analysis; Classification algorithms; Support vector machine classification; Testing; Text categorization; Training; feature selection; machine learning; support vector machine; text classification;
Conference_Titel :
Fuzzy Systems and Knowledge Discovery (FSKD), 2012 9th International Conference on
Conference_Location :
Sichuan
Print_ISBN :
978-1-4673-0025-4
DOI :
10.1109/FSKD.2012.6234223