DocumentCode :
659458
Title :
Distributed confidence-weighted classification on MapReduce
Author :
Djuric, Nemanja ; Grbovic, Mihajlo ; Vucetic, Slobodan
Author_Institution :
Temple Univ., Philadelphia, PA, USA
fYear :
2013
fDate :
6-9 Oct. 2013
Firstpage :
458
Lastpage :
466
Abstract :
Explosive growth in data size, data complexity, and data rates, triggered by emergence of high-throughput technologies such as remote sensing, crowd-sourcing, social networks, or computational advertising, in recent years has led to an increasing availability of data sets of unprecedented scales, with billions of high-dimensional data examples stored on hundreds of terabytes of memory. In order to make use of this large-scale data and extract useful knowledge, researchers in machine learning and data mining communities are faced with numerous challenges, since the classification algorithms designed for standard desktop computers are not capable of addressing these problems due to memory and time constraints. As a result, there exists an evident need for development of novel, more scalable algorithms that can handle large data sets. In this paper we propose such method, named AROW-MR, a linear SVM solver for efficient training of recently proposed confidence-weighted (CW) classifiers. Linear CW models maintain a Gaussian distribution over parameter vectors, thus allowing a user to estimate, in addition to separating hyperplane between two classes, parameter confidence as well. The proposed method employs MapReduce framework to train CW classifier in a distributed way, obtaining significant improvements in both training time and accuracy. This is achieved through training of local CW classifiers on each mapper, followed by optimally combining local classifiers on the reducer to obtain aggregated, more accurate CW linear model. We validated the proposed algorithm on synthetic data, and further showed that AROW-MR algorithm outperforms the baseline classifiers on an industrial, large-scale task of Ad Latency prediction, with nearly one billion examples.
Keywords :
Gaussian distribution; computational complexity; constraint handling; distributed processing; pattern classification; support vector machines; AROW-MR algorithm; Gaussian distribution; MapReduce; ad latency prediction; classification algorithms; computational advertising; confidence-weighted classifiers; crowd-sourcing; data complexity; data mining communities; data rates; data size; distributed confidence-weighted classification; high-dimensional data examples; high-throughput technologies; linear CW models; linear SVM solver; machine learning; memory constraints; parameter vectors; remote sensing; social networks; standard desktop computers; time constraints; Accuracy; Computational modeling; Data models; Equations; Gaussian distribution; Prediction algorithms; Training; MapReduce; confidence-weighted classification;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Big Data, 2013 IEEE International Conference on
Conference_Location :
Silicon Valley, CA
Type :
conf
DOI :
10.1109/BigData.2013.6691607
Filename :
6691607
Link To Document :
بازگشت