DocumentCode :
3243090
Title :
A Text Feature Selection Algorithm Based on Improved TFIDF
Author :
Yang, Chengcheng ; He, Xingshi
Author_Institution :
Xi´´an Polytech. Univ., Xi´´an
fYear :
2008
fDate :
22-24 Oct. 2008
Firstpage :
1
Lastpage :
4
Abstract :
In Chinese text categorization system, for most classifiers using vector space model (VSM), all attributes of documents construct a high dimensional feature space. And the high dimensionality of feature space is the bottleneck of categorization. TFIDF is a kind of common methods used to measure the terms in a document. The method is easy but it doesn´t consider the unbalance distribution of terms among classes. This paper analyzed the TFIDF feature selection algorithm deeply, and proposed a new TFIDF feature selection method based on Gini index theory. Experimental results show the method is valid in improving the accuracy of text categorization.
Keywords :
natural language processing; text analysis; Chinese text categorization system; Gini index theory; TFIDF feature selection method; text feature selection algorithm; vector space model; Algorithm design and analysis; Electronic mail; Entropy; Frequency; Helium; Mutual information; Text categorization;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Pattern Recognition, 2008. CCPR '08. Chinese Conference on
Conference_Location :
Beijing
Print_ISBN :
978-1-4244-2316-3
Type :
conf
DOI :
10.1109/CCPR.2008.87
Filename :
4663040
Link To Document :
بازگشت