Title :
Data mining from extreme data sets: very large and/or very skewed data sets
Author :
Hall, Lawrence O.
Author_Institution :
Dept. of Comput. Sci. & Eng., Univ. of South Florida, Tampa, FL, USA
Abstract :
The article describes an approach to the construction of classifiers from imbalanced data sets. A data set is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of normal examples with only a small percentage of abnormal or interesting examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classier to the minority class. We discuss a combination of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance than only under-sampling the majority class. Our method of over-sampling the minority class involves creating synthetic minority class examples. Performance is measured using the area under the receiver operating characteristic curve. It is shown that generally a more diverse set of operating points can be found with the combination of over and undersampling of an imbalanced data set. Usually, the best of the true positives with minimal false negatives is found when compared with loss ratios, different classification costs, etc. Details are provided
Keywords :
data mining; pattern classification; classification; classifiers; data mining; extreme data sets; imbalanced data sets; majority class; minority class; over-sampling; receiver operating characteristic curve; sensitivity; under-sampling; very large data sets; very skewed data sets; Amino acids; Area measurement; Bagging; Boosting; Computer science; Costs; Data mining; Machine learning; Proteins; Voting;
Conference_Titel :
Systems, Man, and Cybernetics, 2001 IEEE International Conference on
Conference_Location :
Tucson, AZ
Print_ISBN :
0-7803-7087-2
DOI :
10.1109/ICSMC.2001.972946