Text Classificationg for Imbalanced Data Sets

Author

Li, Yanling ; Zhu, Yehang ; Yang, Ping

Author_Institution

Xi´´an Res. Inst. of Hi-Technol., Xi´´an

Volume

2

fYear

2008

fDate

20-22 Dec. 2008

Firstpage

778

Lastpage

781

Abstract

Imbalanced data set has caused a significant drawback of the classification performance attainable by most normal machine learning algorithm. However, the samples are often imbalanced. Therefore, how to reduce the effects of uneven distribution of training sets on text classification performance is a great challenge for machine learning on imbalanced data sets. Currently, the study on imbalaced data mainly lies in two aspects: data-level and algorithm-level. The paper focuses on the study of the three solutions: sample set restructuring, enhancement method of feature selection and weight retouch. Experimental results show that these methods are effective in improving classification performance. After comparing and analyzing the effects of these methods based on the experiments, this paper gets expressly some useful conclusions for some key issues, such as which sampling texts should be chosen and how many sampling texts should be decided for sample restructuring, how about defining separate threshold for each category in feature selection and how to adjust the weights in classification algorithm.

Keywords

learning (artificial intelligence); pattern classification; text analysis; enhancement method; feature selection; imbalanced data sets; machine learning; sample set restructuring; text classification performance; training sets; uneven distribution; weight retouch; feature selection; imbalanced data set; re-sampling; text classificationt; weight retouch;

fLanguage

English

Publisher

ieee

Conference_Titel

Information Science and Engineering, 2008. ISISE '08. International Symposium on

Conference_Location

Shanghai

Print_ISBN

978-1-4244-2727-4

Type

conf

DOI

10.1109/ISISE.2008.89

Filename

4732504