Title :
High-Dimensional Unsupervised Selection and Estimation of a Finite Generalized Dirichlet Mixture Model Based on Minimum Message Length
Author :
Bouguila, Nizar ; Ziou, Djemel
Author_Institution :
Concordia Univ., Montreal
Abstract :
We consider the problem of determining the structure of high-dimensional data without prior knowledge of the number of clusters. Data are represented by a finite mixture model based on the generalized Dirichlet distribution. The generalized Dirichlet distribution has a more general covariance structure than the Dirichlet distribution and offers high flexibility and ease of use for the approximation of both symmetric and asymmetric distributions. This makes the generalized Dirichlet distribution more practical and useful. An important problem in mixture modeling is the determination of the number of clusters. Indeed, a mixture with too many or too few components may not be appropriate to approximate the true model. Here, we consider the application of the minimum message length (MML) principle to determine the number of clusters. The MML is derived so as to choose the number of clusters in the mixture model that best describes the data. A comparison with other selection criteria is performed. The validation involves synthetic data, real data clustering, and two interesting real applications: classification of Web pages, and texture database summarization for efficient retrieval.
Keywords :
data handling; data structures; unsupervised learning; Web page classification; asymmetric distribution approximation; data representation; finite generalized Dirichlet mixture model estimation; general covariance structure; generalized Dirichlet distribution; high-dimensional data; high-dimensional unsupervised selection; minimum message length; mixture modeling; real data clustering; synthetic data; texture database summarization; Bayesian methods; Computer vision; Image databases; Information retrieval; Information theory; Pattern recognition; Stochastic processes; Unsupervised learning; Web mining; Web pages; AIC; EM; Finite mixture models; LEC; MDL; MML; data clustering; generalized Dirichlet mixture; image database summarization; information theory; webmining; Algorithms; Artificial Intelligence; Cluster Analysis; Computer Simulation; Data Interpretation, Statistical; Information Storage and Retrieval; Models, Statistical; Natural Language Processing; Pattern Recognition, Automated;
Journal_Title :
Pattern Analysis and Machine Intelligence, IEEE Transactions on
DOI :
10.1109/TPAMI.2007.1095