• DocumentCode
    2916658
  • Title

    A scalable method for instance selection for class-imbalance datasets

  • Author

    De Haro-García, Aida ; García-Pedrajas, Nicolás

  • Author_Institution
    Dept. of Comput. & Numerical Anal., Univ. of Cordoba, Cordoba, Spain
  • fYear
    2011
  • fDate
    22-24 Nov. 2011
  • Firstpage
    1383
  • Lastpage
    1390
  • Abstract
    Instance selection is becoming more and more relevant due to the huge amount of data that is constantly being produced. Research areas such as bioinformatics, text mining and intrusion detection, are generating huge amounts of information that must be dealt with. Instance selection is a powerful tool to reduce that information to manageable datasets. Most of the datasets in these areas shares a common property, they are heavily class-imbalanced. The class of interest, or positive or minority class, is outnumbered many times by the negative, or majority, class. Thus, any instance selection algorithm addressing these problems must take into account two important features of such problems. Firstly, the large size of the datasets that makes scalability issues very relevant. Secondly, the class-imbalanced distribution of the instances. In this paper, we propose a new methodology for instance selection that it is specifically designed for large class-imbalanced datasets. We use a divide-and-conquer approach to deal with the scalability of the algorithms, and a combination of different rounds of instance selection to improve the results in terms of class-imbalance error measures. The validity of the proposed framework is assured using 45 datasets. Our proposal improves the results of standard methods in accuracy and storage reduction, and at the same time is able to reduce the time needed by the algorithms with a time complexity O(n log(n)).
  • Keywords
    computational complexity; data mining; divide and conquer methods; storage management; very large databases; bioinformatics; class-imbalance datasets; class-imbalance error measures; class-imbalanced distribution; divide-and-conquer approach; heavily class-imbalanced; information reduction; instance selection algorithm; intrusion detection; manageable datasets; minority class; positive class; scalable method; storage reduction; text mining; time complexity; Accuracy; Algorithm design and analysis; Complexity theory; Partitioning algorithms; Proposals; Scalability; Training; Class-imbalanced problems; Data mining; Instance selection; Scaling up;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Intelligent Systems Design and Applications (ISDA), 2011 11th International Conference on
  • Conference_Location
    Cordoba
  • ISSN
    2164-7143
  • Print_ISBN
    978-1-4577-1676-8
  • Type

    conf

  • DOI
    10.1109/ISDA.2011.6121853
  • Filename
    6121853