مرکز منطقه ای اطلاع رساني علوم و فناوري - A Comparative Study on Feature Selection in Unbalance Text Classification

DocumentCode :

1909655

Title :

A Comparative Study on Feature Selection in Unbalance Text Classification

Author :

Yan Xu

Author_Institution :

Coll. of Comput. Sci., Beijing Language & Culture Univ., Beijing, China

fYear :

2012

fDate :

14-16 Dec. 2012

Firstpage :

Lastpage :

Abstract :

Feature selection plays an important role in text classification. Unbalanced text classification is a kind of special classification problem, which is widely used in practice. However, what is the most effective method on unbalanced text classification? As we all know there was not a systematic research about these feature selection methods on unbalanced text classification. This paper is a comparative study of feature selection methods in this problem. The focus is on aggressive dimensionality reduction. We run our experiments on both Chinese and English corpus. Seven methods were evaluated, including term selection based on document frequency (DF), information gain(IG), CH feature selection method, mutual information(MI), expected cross entropy (ECE), the weight of evidence for text (WET) and odds ratio (ODD). We found ODD and WET most effective in two-class classification task, in contrast, IG and CHI had relatively poor performance due to their bias towards favoring rare terms, and its sensitivity to probability estimation errors. However, in multi-class task, the IG and CHI perform had a better performance but MI perform poorly.

Keywords :

entropy; feature extraction; natural language processing; pattern classification; probability; text analysis; Chinese corpus; English corpus; aggressive dimensionality reduction; classification problem; document frequency; expected cross entropy; feature selection method; information gain; multiclass task; mutual information; odds ratio; probability estimation error; rare terms; two-class classification task; unbalance text classification; weight of evidence for text; DF; artificial intelligence; feature selection; spam filtering; unbalanced text classification;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Information Science and Engineering (ISISE), 2012 International Symposium on

Conference_Location :

Shanghai

ISSN :

2160-1283

Print_ISBN :

978-1-4673-5680-0

Type :

conf

DOI :

10.1109/ISISE.2012.19

Filename :

6495295

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1909655