DocumentCode
2118148
Title
Text Classificationg for Imbalanced Data Sets
Author
Li, Yanling ; Zhu, Yehang ; Yang, Ping
Author_Institution
Xi´´an Res. Inst. of Hi-Technol., Xi´´an
Volume
2
fYear
2008
fDate
20-22 Dec. 2008
Firstpage
778
Lastpage
781
Abstract
Imbalanced data set has caused a significant drawback of the classification performance attainable by most normal machine learning algorithm. However, the samples are often imbalanced. Therefore, how to reduce the effects of uneven distribution of training sets on text classification performance is a great challenge for machine learning on imbalanced data sets. Currently, the study on imbalaced data mainly lies in two aspects: data-level and algorithm-level. The paper focuses on the study of the three solutions: sample set restructuring, enhancement method of feature selection and weight retouch. Experimental results show that these methods are effective in improving classification performance. After comparing and analyzing the effects of these methods based on the experiments, this paper gets expressly some useful conclusions for some key issues, such as which sampling texts should be chosen and how many sampling texts should be decided for sample restructuring, how about defining separate threshold for each category in feature selection and how to adjust the weights in classification algorithm.
Keywords
learning (artificial intelligence); pattern classification; text analysis; enhancement method; feature selection; imbalanced data sets; machine learning; sample set restructuring; text classification performance; training sets; uneven distribution; weight retouch; feature selection; imbalanced data set; re-sampling; text classificationt; weight retouch;
fLanguage
English
Publisher
ieee
Conference_Titel
Information Science and Engineering, 2008. ISISE '08. International Symposium on
Conference_Location
Shanghai
Print_ISBN
978-1-4244-2727-4
Type
conf
DOI
10.1109/ISISE.2008.89
Filename
4732504
Link To Document