DocumentCode :
55747
Title :
Latent IBP Compound Dirichlet Allocation
Author :
Archambeau, Cedric ; Lakshminarayanan, Balaji ; Bouchard, Guillaume
Author_Institution :
Amazon Berlin, , Berlin, Germany
Volume :
37
Issue :
2
fYear :
2015
fDate :
Feb. 2015
Firstpage :
321
Lastpage :
333
Abstract :
We introduce the four-parameter IBP compound Dirichlet process (ICDP), a stochastic process that generates sparse non-negative vectors with potentially an unbounded number of entries. If we repeatedly sample from the ICDP we can generate sparse matrices with an infinite number of columns and power-law characteristics. We apply the four-parameter ICDP to sparse nonparametric topic modelling to account for the very large number of topics present in large text corpora and the power-law distribution of the vocabulary of natural languages. The model, which we call latent IBP compound Dirichlet allocation (LIDA), allows for power-law distributions, both, in the number of topics summarising the documents and in the number of words defining each topic. It can be interpreted as a sparse variant of the hierarchical Pitman-Yor process when applied to topic modelling. We derive an efficient and simple collapsed Gibbs sampler closely related to the collapsed Gibbs sampler of latent Dirichlet allocation (LDA), making the model applicable in a wide range of domains. Our nonparametric Bayesian topic model compares favourably to the widely used hierarchical Dirichlet process and its heavy tailed version, the hierarchical Pitman-Yor process, on benchmark corpora. Experiments demonstrate that accounting for the power-distribution of real data is beneficial and that sparsity provides more interpretable results.
Keywords :
Analytical models; Atomic measurements; Bayes methods; Compounds; Data models; Resource management; Vocabulary; Bayesian nonparametrics; Gibbs sampling; bag-of-words representation; clustering; power-law distribution; sparse modelling; topic modelling;
fLanguage :
English
Journal_Title :
Pattern Analysis and Machine Intelligence, IEEE Transactions on
Publisher :
ieee
ISSN :
0162-8828
Type :
jour
DOI :
10.1109/TPAMI.2014.2313122
Filename :
6780626
Link To Document :
بازگشت