Title :
Distributed hierarchical co-clustering and collaborative filtering algorithm
Author :
Narang, Arun ; Srivastava, Anurag ; Katta, N.P.K.
Author_Institution :
IBM India Res. Lab., New Delhi, India
Abstract :
Petascale Analytics is a hot research area both in academia and industry. It envisages processing massive amounts of data at extremely high rates to generate new scientific insights along with positive impact (for both users and providers) of industries such as E-commerce, Telecom, Finance, Life Sciences and so forth. We consider collaborative filtering (CF) and Clustering algorithms that are key fundamental analytics kernels that help in achieving these aims. Real-time CF and co-clustering on highly sparse massive datasets, while achieving a high prediction accuracy, is a computationally challenging problem. In this paper, we present a novel hierarchical design for soft real-time (less than 1 minute.) distributed co-clustering based collaborative filtering algorithm. Our distributed algorithm has been optimized for multi-core cluster architectures. Theoretical analysis of the time complexity of our algorithm proves the efficacy of our approach. Using the Netflix dataset (900M training ratings with replication) as well as the Yahoo KDD Cup 1 (4.6B training ratings with replication) datasets, we demonstrate the performance and scalability of our algorithm on a 4096-node multi-core cluster architecture. Our distributed algorithm (implemented using OpenMP with MPI) demonstrates around 4x better performance (on Blue Gene/P) as compared to the best prior work, along with high accuracy (26 ± 4 RMSE for Yahoo KDD Cup data and 0.87 ± 0.02 for Netflix data). To the best of our knowledge, these are the best known performance results for collaborative filtering, at high prediction accuracy, for multi-core cluster architectures.
Keywords :
collaborative filtering; computational complexity; distributed algorithms; multiprocessing systems; pattern clustering; Blue Gene/P; MPI; Netflix dataset; OpenMP; RMSE; Yahoo KDD Cup dataset; analytics kernels; collaborative filtering algorithm; distributed hierarchical co-clustering algorithm; multicore cluster architecture; petascale analytics; real-time CF algorithms; soft real-time distributed co-clustering; time complexity;
Conference_Titel :
High Performance Computing (HiPC), 2012 19th International Conference on
Conference_Location :
Pune
Print_ISBN :
978-1-4673-2372-7
Electronic_ISBN :
978-1-4673-2370-3
DOI :
10.1109/HiPC.2012.6507497