DocumentCode :
1791575
Title :
Sparse computation for large-scale data mining
Author :
Hochbaum, Dorit S. ; Baumann, Philipp
Author_Institution :
Univ. of California, Berkeley, Berkeley, CA, USA
fYear :
2014
fDate :
27-30 Oct. 2014
Firstpage :
354
Lastpage :
363
Abstract :
Several leading data mining and clustering algorithms rely on inputs in the form of pairwise similarities. Yet, since the number of potential pairwise similarities grows quadratically in the size of the data set, it is computationally prohibitive to apply such algorithms to large data sets. This paper addresses this challenge with a novel method of sparse computation that computes only the relevant similarities instead of the complete similarity matrix. The method employs an efficient algorithm that provides an “approximate Principal Component Analysis”. In the low-dimensional space generated, the concept of grid neighborhoods is applied in order to identify groups of objects with potentially high similarity. Unlike known sparsification approaches that generate first the full set of pairwise similarities and thus take at least quadratic time, the sparse computation method generates only the relevant similarities. Sparse computation can be utilized in any data mining or clustering algorithm that requires pairwise similarities, such as the k-nearest neighbors algorithm or the spectral method. This approach is contrasted with that of grid-based clustering algorithms in that grid neighborhoods proximity is used only to determine the entries in the sparse similarity matrix, not to identify the clusters. Indeed objects can belong to the same grid neighborhood while ending up in different clusters, or conversely, belong to different neighborhoods yet get clustered jointly. The applicability of sparse computation for binary classification is demonstrated here for the recently devised supervised normalized cut (SNC). Our empirical results show that the approach achieves a significant reduction in the density of the similarity matrix, resulting in a substantial reduction in running time, while having a minimal effect (and often none) on accuracy as compared to inputs using a complete similarity matrix.
Keywords :
data mining; grid computing; pattern classification; pattern clustering; sparse matrices; SNC; approximate principal component analysis; binary classification; grid neighborhoods proximity; grid-based clustering algorithms; k-nearest neighbors algorithm; large-scale data mining; pairwise similarities; sparse computation; sparse similarity matrix; sparsification; spectral method; supervised normalized cut; Accuracy; Approximation algorithms; Approximation methods; Data mining; Principal component analysis; Sparse matrices; Vectors;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Big Data (Big Data), 2014 IEEE International Conference on
Conference_Location :
Washington, DC
Type :
conf
DOI :
10.1109/BigData.2014.7004252
Filename :
7004252
Link To Document :
بازگشت