• DocumentCode
    666106
  • Title

    A novel evolutionary preprocessing method based on over-sampling and under-sampling for imbalanced datasets

  • Author

    Wong, Ginny Y. ; Leung, Frank H. F. ; Sai-Ho Ling

  • Author_Institution
    Dept. of Electron. & Inf. Eng., Hong Kong Polytech. Univ., Hong Kong, China
  • fYear
    2013
  • fDate
    10-13 Nov. 2013
  • Firstpage
    2354
  • Lastpage
    2359
  • Abstract
    Imbalanced datasets are commonly encountered in real-world classification problems. However, many machine learning algorithms are originally designed for well-balanced datasets. Re-sampling has become an important step to preprocess imbalanced dataset. It aims at balancing the datasets by increasing the sample size of the smaller class or decreasing the sample size of the larger class, which are known as over-sampling and under-sampling respectively. In this paper, a novel sampling strategy based on both over-sampling and under-sampling is proposed, in which the new samples of the smaller class are created by the Synthetic Minority Over-sampling Technique (SMOTE). The improvement of the datasets is done by the evolutionary computational method of CHC that works on both the minority class and majority class samples. The result is a hybrid data preprocessing method that combines both over-sampling and under-sampling techniques to re-sample datasets. The evaluation is done by applying the learning algorithm C4.5 to obtain a classification model from the re-sampled datasets. Experimental results reported that the proposed approach can decrease the over-sampling rate about 50% with only around 3% discrepancy on the accuracy.
  • Keywords
    evolutionary computation; learning (artificial intelligence); pattern classification; sampling methods; CHC; SMOTE; classification model; datasets balancing; evolutionary computational method; evolutionary preprocessing method; hybrid data preprocessing method; imbalanced datasets; machine learning algorithms; over-sampling techniques; real-world classification problems; resampling; synthetic minority over-sampling technique; under-sampling techniques; well-balanced datasets; Biological cells; Data preprocessing; Gold; Sociology; Statistics; Training; Vectors;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Industrial Electronics Society, IECON 2013 - 39th Annual Conference of the IEEE
  • Conference_Location
    Vienna
  • ISSN
    1553-572X
  • Type

    conf

  • DOI
    10.1109/IECON.2013.6699499
  • Filename
    6699499