DocumentCode
1104218
Title
High-Dimensional Unsupervised Selection and Estimation of a Finite Generalized Dirichlet Mixture Model Based on Minimum Message Length
Author
Bouguila, Nizar ; Ziou, Djemel
Author_Institution
Concordia Univ., Montreal
Volume
29
Issue
10
fYear
2007
Firstpage
1716
Lastpage
1731
Abstract
We consider the problem of determining the structure of high-dimensional data without prior knowledge of the number of clusters. Data are represented by a finite mixture model based on the generalized Dirichlet distribution. The generalized Dirichlet distribution has a more general covariance structure than the Dirichlet distribution and offers high flexibility and ease of use for the approximation of both symmetric and asymmetric distributions. This makes the generalized Dirichlet distribution more practical and useful. An important problem in mixture modeling is the determination of the number of clusters. Indeed, a mixture with too many or too few components may not be appropriate to approximate the true model. Here, we consider the application of the minimum message length (MML) principle to determine the number of clusters. The MML is derived so as to choose the number of clusters in the mixture model that best describes the data. A comparison with other selection criteria is performed. The validation involves synthetic data, real data clustering, and two interesting real applications: classification of Web pages, and texture database summarization for efficient retrieval.
Keywords
data handling; data structures; unsupervised learning; Web page classification; asymmetric distribution approximation; data representation; finite generalized Dirichlet mixture model estimation; general covariance structure; generalized Dirichlet distribution; high-dimensional data; high-dimensional unsupervised selection; minimum message length; mixture modeling; real data clustering; synthetic data; texture database summarization; Bayesian methods; Computer vision; Image databases; Information retrieval; Information theory; Pattern recognition; Stochastic processes; Unsupervised learning; Web mining; Web pages; AIC; EM; Finite mixture models; LEC; MDL; MML; data clustering; generalized Dirichlet mixture; image database summarization; information theory; webmining; Algorithms; Artificial Intelligence; Cluster Analysis; Computer Simulation; Data Interpretation, Statistical; Information Storage and Retrieval; Models, Statistical; Natural Language Processing; Pattern Recognition, Automated;
fLanguage
English
Journal_Title
Pattern Analysis and Machine Intelligence, IEEE Transactions on
Publisher
ieee
ISSN
0162-8828
Type
jour
DOI
10.1109/TPAMI.2007.1095
Filename
4293203
Link To Document