• DocumentCode
    3426383
  • Title

    Diversity analysis on imbalanced data sets by using ensemble models

  • Author

    Wang, Shuo ; Yao, Xin

  • Author_Institution
    Sch. of Comput. Sci., Univ. of Birmingham, Birmingham
  • fYear
    2009
  • fDate
    March 30 2009-April 2 2009
  • Firstpage
    324
  • Lastpage
    331
  • Abstract
    Many real-world applications have problems when learning from imbalanced data sets, such as medical diagnosis, fraud detection, and text classification. Very few minority class instances cannot provide sufficient information and result in performance degrading greatly. As a good way to improve the classification performance of weak learner, some ensemble-based algorithms have been proposed to solve class imbalance problem. However, it is still not clear that how diversity affects classification performance especially on minority classes, since diversity is one influential factor of ensemble. This paper explores the impact of diversity on each class and overall performance. As the other influential factor, accuracy is also discussed because of the trade-off between diversity and accuracy. Firstly, three popular re-sampling methods are combined into our ensemble model and evaluated for diversity analysis, which includes under-sampling, over-sampling, and SMOTE - a data generation algorithm. Secondly, we experiment not only on two-class tasks, but also those with multiple classes. Thirdly, we improve SMOTE in a novel way for solving multi-class data sets in ensemble model - SMOTEBagging.
  • Keywords
    data handling; sampling methods; SMOTE; SMOTEBagging; class imbalance problem; data generation algorithm; diversity analysis; ensemble models; ensemble-based algorithms; fraud detection; imbalanced data sets; medical diagnosis; multi-class data sets; over-sampling; resampling methods; text classification; two-class tasks; under-sampling; weak learner; Bagging; Boosting; Costs; Data analysis; Intrusion detection; Medical diagnosis; Predictive models; Semisupervised learning; Text categorization; Voting;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computational Intelligence and Data Mining, 2009. CIDM '09. IEEE Symposium on
  • Conference_Location
    Nashville, TN
  • Print_ISBN
    978-1-4244-2765-9
  • Type

    conf

  • DOI
    10.1109/CIDM.2009.4938667
  • Filename
    4938667