• DocumentCode
    3603465
  • Title

    Co-ClusterD: A Distributed Framework for Data Co-Clustering with Sequential Updates

  • Author

    Xiang Cheng ; Sen Su ; Lixin Gao ; Jiangtao Yin

  • Author_Institution
    State Key Lab. of Networking & Switching Technol., Beijing Univ. of Posts & Telecommun., Beijing, China
  • Volume
    27
  • Issue
    12
  • fYear
    2015
  • Firstpage
    3231
  • Lastpage
    3244
  • Abstract
    Co-clustering has emerged to be a powerful data mining tool for two-dimensional co-occurrence and dyadic data. However, co-clustering algorithms often require significant computational resources and have been dismissed as impractical for large data sets. Existing studies have provided strong empirical evidence that expectation-maximization (EM) algorithms (e.g., k-means algorithm) with sequential updates can significantly reduce the computational cost without degrading the resulting solution. Motivated by this observation, we introduce sequential updates for alternate minimization co-clustering (AMCC) algorithms which are variants of EM algorithms, and also show that AMCC algorithms with sequential updates converge. We then propose two approaches to parallelize AMCC algorithms with sequential updates in a distributed environment. Both approaches are proved to maintain the convergence properties of AMCC algorithms. Based on these two approaches, we present a new distributed framework, Co-ClusterD, which supports efficient implementations of AMCC algorithms with sequential updates. We design and implement Co-ClusterD, and show its efficiency through two AMCC algorithms: fast nonnegative matrix tri-factorization (FNMTF) and information theoretic co-clustering (ITCC). We evaluate our framework on both a local cluster of machines and the Amazon EC2 cloud. Empirical results show that AMCC algorithms implemented in Co-ClusterD can achieve a much faster convergence and often obtain better results than their traditional concurrent counterparts.
  • Keywords
    cloud computing; cost reduction; data mining; expectation-maximisation algorithm; matrix decomposition; minimisation; pattern clustering; sequential estimation; 2D co-occurrence; AMCC algorithm; Amazon EC2 cloud; Co-ClusterD; EM algorithm; FNMTF; ITCC; alternate minimization co-clustering; computational cost reduction; data co-clustering; data mining tool; distributed framework; dyadic data; expectation-maximization algorithm; fast nonnegative matrix tri-factorization; information theoretic co-clustering; sequential update; Algorithm design and analysis; Approximation algorithms; Clustering algorithms; Convergence; Linear programming; Minimization; Prototypes; Cloud Computing; Co-Clustering; Concurrent Updates; Distributed Framework; Sequential Updates; cloud computing; concurrent updates; distributed framework; sequential updates;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2015.2451634
  • Filename
    7145441