• DocumentCode
    2984761
  • Title

    High Performance Offline and Online Distributed Collaborative Filtering

  • Author

    Narang, Arun ; Srivastava, Anurag ; Katta, N.P.K.

  • Author_Institution
    IBM India Res. Lab., New Delhi, India
  • fYear
    2012
  • fDate
    10-13 Dec. 2012
  • Firstpage
    549
  • Lastpage
    558
  • Abstract
    Big data analytics is a hot research area both in academia and industry. It envisages processing massive amounts of data at high rates to generate new insights leading to positive impact (for both users and providers) of industries such as E-commerce, Telecom, Finance, Life Sciences and so forth. We consider collaborative filtering (CF) and Clustering algorithms that are key fundamental analytics kernels that help in achieving these aims. High throughput CF and co-clustering on highly sparse and massive datasets, along with a high prediction accuracy, is a computationally challenging problem. In this paper, we present a novel hierarchical design for soft real-time (less than 1-minute.) distributed co-clustering based collaborative filtering algorithm. We study both the online and offline variants of this algorithm. Theoretical analysis of the time complexity of our algorithm proves the efficacy of our approach. Further, we present the impact of load balancing based optimizations on multi-core cluster architectures. Using the Netflix dataset(900M training ratings with replication) as well as the Yahoo KDD Cup(2.3B training ratings with replication) datasets, we demonstrate the performance and scalability of our algorithm on a large multi-core cluster architecture. In offline mode, our distributed algorithm demonstrates around 4x better performance (on Blue Gene/P) as compared to the best prior work, along with high accuracy. In online mode, we demonstrated around 3x better performance compared to baseline MPI implementation. To the best of our knowledge, our algorithm provides the best known online and offline performance and scalability results with high accuracy on multi-core cluster architectures.
  • Keywords
    collaborative filtering; computational complexity; distributed algorithms; multiprocessing systems; resource allocation; Netflix dataset; analytics kernels; clustering algorithm; data analytics; distributed algorithm; high performance offline distributed collaborative filtering; load balancing based optimization; multicore cluster architecture; online distributed collaborative filtering algorithm; time complexity; Algorithm design and analysis; Approximation methods; Clustering algorithms; Collaboration; Matrix decomposition; Partitioning algorithms; Training; Distributed Collaborative Filtering; Parallel Performance Optimizations; Performance & Scalability Analysis;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining (ICDM), 2012 IEEE 12th International Conference on
  • Conference_Location
    Brussels
  • ISSN
    1550-4786
  • Print_ISBN
    978-1-4673-4649-8
  • Type

    conf

  • DOI
    10.1109/ICDM.2012.128
  • Filename
    6413871