Abstract :
A major difficulty of text categorization is the high dimensionality of the feature space. Feature selection is an important step in text categorization to reduce the feature space. Automatic feature selection methods such as document frequency thresholding (DF), information gain (IG), mutual information (MI), and so on are commonly applied in text categorization, but they do not use term frequency information. In this paper, we put forward improved DF, improved IG and improved MI methods which use term frequency information. Experiments show that our improved methods are seen notable improvements in the performance than the original DF, IG and MI methods.
Keywords :
statistical analysis; text analysis; feature selection; improved document frequency thresholding; improved information gain; improved mutual information; term frequency information; text categorization; Classification algorithms; Frequency conversion; Machine learning; Mutual information; Text categorization; Time frequency analysis;