• DocumentCode
    1016640
  • Title

    Clustering of Count Data Using Generalized Dirichlet Multinomial Distributions

  • Author

    Bouguila, Nizar

  • Author_Institution
    Concordia Univ., Montreal
  • Volume
    20
  • Issue
    4
  • fYear
    2008
  • fDate
    4/1/2008 12:00:00 AM
  • Firstpage
    462
  • Lastpage
    474
  • Abstract
    In this paper, we examine the problem of count data clustering. We analyze this problem using finite mixtures of distributions. The multinomial distribution and the multinomial Dirichlet distribution (MDD) are widely accepted to model count data. We show that these two distributions cannot be the best choice in all the applications, and we propose another model called the multinomial generalized Dirichlet distribution (MGDD) that is the composition of the generalized Dirichlet distribution and the multinomial, in the same way that the MDD is the composition of the Dirichlet and the multinomial. The estimation of the parameters and the determination of the number of components in our model are based on the deterministic annealing expectation-maximization (DAEM) approach and the minimum description length (MDL) criterion, respectively. We compare our method to standard approaches such as multinomial and multinomial Dirichlet mixtures to show its merits. The comparison involves different applications such as spatial color image databases indexing, handwritten digit recognition, and text document clustering.
  • Keywords
    expectation-maximisation algorithm; pattern clustering; statistical distributions; count data clustering; deterministic annealing expectation-maximization approach; handwritten digit recognition; minimum description length criterion; multinomial generalized Dirichlet distribution; parameter estimation; spatial color image databases indexing; text document clustering; Feature extraction; Image databases; clustering;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2007.190726
  • Filename
    4407701