DocumentCode
3227919
Title
Fine Text Categorization: Using Very Aggressive Feature Selection to Cope with Mass Duplicated Features
Author
DAI, Liuling ; HU, Jinwu ; Wu, ShiKun
Author_Institution
Sch. of Comput. Sci., Beijing Inst. of Technol., Beijing
Volume
2
fYear
2008
fDate
20-22 Oct. 2008
Firstpage
984
Lastpage
988
Abstract
Text categorization is a key issue of text mining. Although there are many studies on this problem, the majority of them are focused on classification of rough categories. In this kind of problem, there are obviously different features that can differentiate one category from others. Only very few researches concerned fine text categorization (FTC) problem which is characterized by many duplicated features across different categories. In this paper, we firstly pointed out that traditional feature selection levels canpsilat be directly used to cope with this problem. In order to improve performance, we performed very aggressive feature selection (VAFS) by firstly removing the common features arbitrarily, and then selecting features with modified CHI-square statistic in a very aggressive manner. At last, Only very few features are used to learnt the underlying concepts of categories. Experimental results shows that VAFS improves performance notabely and rule based algorithms are more suitable than vector based algorithms.
Keywords
data mining; knowledge based systems; statistical analysis; text analysis; aggressive feature selection; fine text categorization; mass duplicated features; rough categories; rule based algorithms; text mining; Automation; Information retrieval; Information technology; Laboratories; Machine learning algorithms; Partial response channels; Support vector machine classification; Support vector machines; Text categorization; Text mining; SVM; feature selection; fine text categorization; kNN; rough set;
fLanguage
English
Publisher
ieee
Conference_Titel
Intelligent Computation Technology and Automation (ICICTA), 2008 International Conference on
Conference_Location
Hunan
Print_ISBN
978-0-7695-3357-5
Type
conf
DOI
10.1109/ICICTA.2008.90
Filename
4659910
Link To Document