DocumentCode
654087
Title
Selecting samples for labeling in unbalanced streaming data environments
Author
Hanqing Hu ; Kantardzic, Mehmed M. ; Sethi, Tegjyot Singh
Author_Institution
CECS Dept., Univ. of Louisville, Louisville, KY, USA
fYear
2013
fDate
Oct. 30 2013-Nov. 1 2013
Firstpage
1
Lastpage
7
Abstract
In this paper we proposed an alternative approach to random selection for labeling extremely unbalanced stream data sets where one class is only 1-10% of the entire data set. Labeling, especially when human resources are needed, is often time consuming and expensive. In an extremely unbalanced data set, usually a lot of data points need to be labeled to get enough minority class samples. The goal of this research was to reduce the total number of samples needed in the labeling process of training new classification models for updating streaming data ensemble classifier. Our proposed approach is to find minority class clusters using the grid density algorithm, and sample minority class instances inside those regions. The result from the synthetic data set showed that efficiency of our proposed approaches varies with different grid sizes. Results on real world data sets confirmed that observation, and showed that when the data set has high dimensionality, dimensionality reduction was useful for reducing the number of grids in the data space increasing sampling efficiency. Our best results showed 19.4% improvement for an eight-dimension data set without dimensionality reduction, and 27.4% improvement for a thirty-six-dimension data set with dimensionality reduction.
Keywords
data analysis; pattern classification; pattern clustering; random processes; sampling methods; classification model training; data point labelling; data set dimensionality reduction; data space; eight-dimension data set improvement; extremely-unbalanced stream data set labeling; grid density algorithm; grid sizes; human resources; minority class clusters; minority-class samples; random sample selection; sampling efficiency improvement; streaming data ensemble classifier update; synthetic data set; thirty-six-dimension data set improvement; unbalanced streaming data environment; Algorithm design and analysis; Classification algorithms; Clustering algorithms; Data mining; Data models; Labeling; Training; Classification; Grid Density; Labeling; Stream Data;
fLanguage
English
Publisher
ieee
Conference_Titel
Information, Communication and Automation Technologies (ICAT), 2013 XXIV International Symposium on
Conference_Location
Sarajevo
Type
conf
DOI
10.1109/ICAT.2013.6684046
Filename
6684046
Link To Document