Title :
Feature Selection with Maximum Information Metric in Text Categorization
Author :
Wang, Haijuan ; Han, Lixin ; Zeng, Xiaoqin ; Zhen, Zhilong
Author_Institution :
Dept. of Math., Tonghua Normal Univ., Tonghua, China
Abstract :
Text categorization usually suffers from a huge-scale number of features. Most of those are irrelevant and noise which could mislead the classifier. In order to improve the efficiency and effectiveness for text categorization, feature selection is often performed. In this paper, a novel feature selection approach for dealing with text categorization, called Maximum Information Metric (MIM), is proposed to get good quality terms of documents. This method exploits the weight of term and document frequency to construct the correlation between a term and each class. It aims to maximize the differences of term over each class based on information theory. We design a better evaluation function to yield a kind of ranking of the features. Experimental results on the standard Reuters-21578 and 20-Newsgroups corpus show that the new feature selection approach outperforms the classic methods including Information Gain (IG), Chi-square statistic (CHI) in a context of text categorization.
Keywords :
document handling; feature extraction; information retrieval; information theory; text analysis; 20-Newsgroups corpus; Chi-square statistic; Information Gain; classifier; document frequency; evaluation function; feature selection; huge-scale number; information theory; maximum information metric; standard Reuters-21578; term weight; text categorization; Computer science; Educational institutions; Information filtering; Information filters; Information retrieval; Information science; Information theory; Mathematics; Statistics; Text categorization;
Conference_Titel :
Information Science and Engineering (ICISE), 2009 1st International Conference on
Conference_Location :
Nanjing
Print_ISBN :
978-1-4244-4909-5
DOI :
10.1109/ICISE.2009.591