• DocumentCode
    2370295
  • Title

    Information theoretic clustering of sparse cooccurrence data

  • Author

    Dhillon, Inderjit S. ; Guan, Yuqiang

  • Author_Institution
    Dept. of Comput. Sci., Texas Univ., Austin, TX, USA
  • fYear
    2003
  • fDate
    19-22 Nov. 2003
  • Firstpage
    517
  • Lastpage
    520
  • Abstract
    A novel approach to clustering cooccurrence data poses it as an optimization problem in information theory which minimizes the resulting loss in mutual information. A divisive clustering algorithm that monotonically reduces this loss function was recently proposed. We show that sparse high-dimensional data presents special challenges which can result in the algorithm getting stuck at poor local minima. We propose two solutions to this problem: (a) a "prior" to overcome infinite relative entropy values as in the supervised Naive Bayes algorithm, and (b) local search to escape local minima. Finally, we combine these solutions to get a robust algorithm that is computationally efficient. We present experimental results to show that the proposed method is effective in clustering document collections and outperform previous information-theoretic clustering approaches.
  • Keywords
    Bayes methods; information theory; learning (artificial intelligence); optimisation; pattern clustering; divisive clustering algorithm; document clustering; information theory; local minima; sparse high-dimensional cooccurrence data; supervised Naive Bayes algorithm; Character generation; Clustering algorithms; Entropy; Information theory; Loss measurement; Mutual information; Probability distribution; Random variables; Robustness; Unsupervised learning;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining, 2003. ICDM 2003. Third IEEE International Conference on
  • Print_ISBN
    0-7695-1978-4
  • Type

    conf

  • DOI
    10.1109/ICDM.2003.1250966
  • Filename
    1250966