Title :
Data Imbalance Problem in Text Classification
Author :
Li, Yanling ; Sun, Guoshe ; Zhu, Yehang
Author_Institution :
Xi´´an Res. Inst. of Hi-Technol., Xi´´an, China
Abstract :
Aimming at the ever-present problem of imbalanced data in text classification, the authors study on several forms of imbalanced data, such as text number, class size, subclass and class fold. Some useful conclusions are gotten from a series of correlative experiments: first, when the text of two class is almost the same number, the difference of word number become major factor to affect the accuracy of the classification, second, to improve the accuracy of the classification through increasing the small class size is limited, third, in the case of unbalanced data, the same words which are appeared in two class often carry strong class information, that is, class overlap will not affect the classification accuracy.
Keywords :
pattern classification; text analysis; data imbalance problem; text classification; word number difference; Accuracy; Classification algorithms; Feature extraction; Machine learning; Text categorization; Training; class fold; class size; data distribution; imbalanced data; text classification;
Conference_Titel :
Information Processing (ISIP), 2010 Third International Symposium on
Conference_Location :
Qingdao
Print_ISBN :
978-1-4244-8627-4
DOI :
10.1109/ISIP.2010.47