Title :
On the clustering of large-scale data: A matrix-based approach
Author :
Wang, Lijun ; Dong, Ming
Author_Institution :
Dept. of Comput. Sci., Wayne State Univ., Detroit, MI, USA
fDate :
July 31 2011-Aug. 5 2011
Abstract :
Nowadays, the analysis of large amounts of digital documents become a hot research topic since the libraries and database are converted electronically, such as PUBMED and IEEE publications. The ubiquitous phenomenon of massive data and sparse information imposes considerable challenges in data mining research. In this paper, we propose a theoretical framework, Exemplar-based Low-rank sparse Matrix Decomposition (ELMD), to cluster large-scale datasets. Specifically, given a data matrix, ELMD first computes a representative data subspace and a near-optimal low-rank approximation. Then, the cluster centroids and indicators are obtained through matrix decomposition, in which we require that the cluster centroids lie within the representative data subspace. From a theoretical perspective, we show the correctness and convergence of the ELMD algorithm, and provide detailed analysis on its efficiency. Through extensive experiments performed on both synthetic and real datasets, we demonstrate the superior performance of ELMD for clustering large-scale data.
Keywords :
approximation theory; data mining; data structures; document handling; matrix decomposition; pattern clustering; set theory; ubiquitous computing; ELMD algorithm; IEEE publication; PUBMED publication; cluster centroids; data matrix; data mining; data subspace; digital database; digital document; digital library; exemplar-based low rank sparse matrix decomposition; large scale data set clustering; matrix-based approach; near optimal low rank approximation; real dataset; sparse information; synthetic dataset; ubiquitous phenomenon; Accuracy; Approximation algorithms; Approximation methods; Clustering algorithms; Matrix decomposition; Noise; Sparse matrices;
Conference_Titel :
Neural Networks (IJCNN), The 2011 International Joint Conference on
Conference_Location :
San Jose, CA
Print_ISBN :
978-1-4244-9635-8
DOI :
10.1109/IJCNN.2011.6033212