• DocumentCode
    2118148
  • Title

    Text Classificationg for Imbalanced Data Sets

  • Author

    Li, Yanling ; Zhu, Yehang ; Yang, Ping

  • Author_Institution
    Xi´´an Res. Inst. of Hi-Technol., Xi´´an
  • Volume
    2
  • fYear
    2008
  • fDate
    20-22 Dec. 2008
  • Firstpage
    778
  • Lastpage
    781
  • Abstract
    Imbalanced data set has caused a significant drawback of the classification performance attainable by most normal machine learning algorithm. However, the samples are often imbalanced. Therefore, how to reduce the effects of uneven distribution of training sets on text classification performance is a great challenge for machine learning on imbalanced data sets. Currently, the study on imbalaced data mainly lies in two aspects: data-level and algorithm-level. The paper focuses on the study of the three solutions: sample set restructuring, enhancement method of feature selection and weight retouch. Experimental results show that these methods are effective in improving classification performance. After comparing and analyzing the effects of these methods based on the experiments, this paper gets expressly some useful conclusions for some key issues, such as which sampling texts should be chosen and how many sampling texts should be decided for sample restructuring, how about defining separate threshold for each category in feature selection and how to adjust the weights in classification algorithm.
  • Keywords
    learning (artificial intelligence); pattern classification; text analysis; enhancement method; feature selection; imbalanced data sets; machine learning; sample set restructuring; text classification performance; training sets; uneven distribution; weight retouch; feature selection; imbalanced data set; re-sampling; text classificationt; weight retouch;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Science and Engineering, 2008. ISISE '08. International Symposium on
  • Conference_Location
    Shanghai
  • Print_ISBN
    978-1-4244-2727-4
  • Type

    conf

  • DOI
    10.1109/ISISE.2008.89
  • Filename
    4732504