DocumentCode :
1843044
Title :
Evaluating the impact of data quality on sampling
Author :
Van Hulse, Jason ; Khoshgoftaar, Taghi M. ; Napolitano, Amri
Author_Institution :
Dept. of Comput. & Electr. Eng. & Comput. Sci., Florida Atlantic Univ., Boca Raton, FL, USA
fYear :
2010
fDate :
4-6 Aug. 2010
Firstpage :
31
Lastpage :
36
Abstract :
Three important data characteristics that can substantially impact a data mining project are class imbalance, poor data quality and the size of the training dataset. Data sampling is a commonly used method for improving learner performance when data is imbalanced. However, little effort has been put forth to investigate the performance of data sampling techniques when data is both noisy and imbalanced. In this work, we present a comprehensive empirical investigation of how data sampling techniques react to changes in four training dataset characteristics: dataset size, class distribution, noise level and noise distribution. We present the performance of four common data sampling techniques using 11 learning algorithms. The results, which are based on an extensive suite of experiments for which over 15 million models were trained and evaluated, show that data sampling can be very effective at dealing with the combined problems of noise and imbalance. In addition, the dataset characteristics which have the greatest impact on each of the data sampling techniques are identified.
Keywords :
data mining; learning (artificial intelligence); sampling methods; class distribution; data mining; data quality; data sampling; dataset size; learning algorithm; noise distribution; noise level; training dataset; Data models; Neodymium; Noise; Noise level; Noise measurement; Training; Training data;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Information Reuse and Integration (IRI), 2010 IEEE International Conference on
Conference_Location :
Las Vegas, NV
Print_ISBN :
978-1-4244-8097-5
Type :
conf
DOI :
10.1109/IRI.2010.5558968
Filename :
5558968
Link To Document :
بازگشت