Title :
Ranking and selecting terms for text categorization via SVM discriminate boundary
Author :
Kuo, Tien-Fang ; Yajima, Yasutoshi
Author_Institution :
Dept. of Ind. Eng. & Manage., Tokyo Inst. of Technol., Japan
Abstract :
The problem of natural language document categorization consists in classifying documents into predetermined categories based on their contents. Each distinct term, or word, in the documents is a feature for representing a document. In general, the number of terms may be extremely large and the dozens of redundant terms may be included, which may deteriorate the performance of classification. In this paper, an SVM based feature ranking and selecting method for text categorization is proposed. The contribution of each term for classification is calculated based on the nonlinear discriminate boundary generated by support vector machine (SVM). The results of experiments on the Reuters-21S78 dataset show that the proposed method achieves higher classification performance than existing feature selection based on LSI and x2 statistics values.
Keywords :
classification; natural languages; support vector machines; text analysis; SVM based feature ranking; SVM based feature selection; SVM discriminate boundary; document classification; natural language document categorization; nonlinear discriminate boundary; support vector machine; term ranking; term selection; text categorization; Content management; Engineering management; Industrial engineering; Large scale integration; Natural languages; Statistics; Support vector machine classification; Support vector machines; Technology management; Text categorization;
Conference_Titel :
Granular Computing, 2005 IEEE International Conference on
Print_ISBN :
0-7803-9017-2
DOI :
10.1109/GRC.2005.1547341