DocumentCode
2984761
Title
High Performance Offline and Online Distributed Collaborative Filtering
Author
Narang, Arun ; Srivastava, Anurag ; Katta, N.P.K.
Author_Institution
IBM India Res. Lab., New Delhi, India
fYear
2012
fDate
10-13 Dec. 2012
Firstpage
549
Lastpage
558
Abstract
Big data analytics is a hot research area both in academia and industry. It envisages processing massive amounts of data at high rates to generate new insights leading to positive impact (for both users and providers) of industries such as E-commerce, Telecom, Finance, Life Sciences and so forth. We consider collaborative filtering (CF) and Clustering algorithms that are key fundamental analytics kernels that help in achieving these aims. High throughput CF and co-clustering on highly sparse and massive datasets, along with a high prediction accuracy, is a computationally challenging problem. In this paper, we present a novel hierarchical design for soft real-time (less than 1-minute.) distributed co-clustering based collaborative filtering algorithm. We study both the online and offline variants of this algorithm. Theoretical analysis of the time complexity of our algorithm proves the efficacy of our approach. Further, we present the impact of load balancing based optimizations on multi-core cluster architectures. Using the Netflix dataset(900M training ratings with replication) as well as the Yahoo KDD Cup(2.3B training ratings with replication) datasets, we demonstrate the performance and scalability of our algorithm on a large multi-core cluster architecture. In offline mode, our distributed algorithm demonstrates around 4x better performance (on Blue Gene/P) as compared to the best prior work, along with high accuracy. In online mode, we demonstrated around 3x better performance compared to baseline MPI implementation. To the best of our knowledge, our algorithm provides the best known online and offline performance and scalability results with high accuracy on multi-core cluster architectures.
Keywords
collaborative filtering; computational complexity; distributed algorithms; multiprocessing systems; resource allocation; Netflix dataset; analytics kernels; clustering algorithm; data analytics; distributed algorithm; high performance offline distributed collaborative filtering; load balancing based optimization; multicore cluster architecture; online distributed collaborative filtering algorithm; time complexity; Algorithm design and analysis; Approximation methods; Clustering algorithms; Collaboration; Matrix decomposition; Partitioning algorithms; Training; Distributed Collaborative Filtering; Parallel Performance Optimizations; Performance & Scalability Analysis;
fLanguage
English
Publisher
ieee
Conference_Titel
Data Mining (ICDM), 2012 IEEE 12th International Conference on
Conference_Location
Brussels
ISSN
1550-4786
Print_ISBN
978-1-4673-4649-8
Type
conf
DOI
10.1109/ICDM.2012.128
Filename
6413871
Link To Document