• DocumentCode
    244728
  • Title

    A new sampling approach for classification of imbalanced data sets with high density

  • Author

    Jia Pengfei ; Zhang Chunkai ; He Zhenyu

  • Author_Institution
    Shenzhen Grad. Sch., Harbin Inst. of Technol., Shenzhen, China
  • fYear
    2014
  • fDate
    15-17 Jan. 2014
  • Firstpage
    217
  • Lastpage
    222
  • Abstract
    Class imbalance of datasets is a common problem in the field of machine learning. In recent years, because the traditional classifier algorithms are designed only for balanced cases, these classifiers always achieved poor performance in imbalanced data classification issues, especially for the imbalanced data with a really high density. This paper introduces the importance of imbalanced data classification in various fields first; then, contends existing methods of solving the imbalanced data classification problem; finally, proposes two new sampling methods, which are based on borderline-SMOTE, for the imbalanced data with high density, especially for big data with this kind of distribution feature. These two new algorithms are not only over-sampling the minority samples near the borderline, but also creating appropriate synthetic samples in the majority class samples side and under-sampling some particular majority class samples. Experiments show that these two algorithms could achieve a better performance than random over sampling, SMOTE (Synthetic minority over-sampling technique) and Borderline-SMOTE in AUC (Area under Receiver Operating Characteristics Curve) metric evaluate method, when the sampling rate makes the majority class and minority class samples approximate equilibrium.
  • Keywords
    learning (artificial intelligence); pattern classification; sampling methods; AUC metric evaluate method; Borderline-SMOTE; area under receiver operating characteristics curve; big data; distribution feature; imbalanced data classification issues; machine learning; majority class samples side; sampling approach; Breast; Classification algorithms; Distributed databases; Information management; Prediction algorithms; Sampling methods; Training; big data; classification; high density; imbalanced data; sampling method;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Big Data and Smart Computing (BIGCOMP), 2014 International Conference on
  • Conference_Location
    Bangkok
  • Type

    conf

  • DOI
    10.1109/BIGCOMP.2014.6741439
  • Filename
    6741439