DocumentCode
55747
Title
Latent IBP Compound Dirichlet Allocation
Author
Archambeau, Cedric ; Lakshminarayanan, Balaji ; Bouchard, Guillaume
Author_Institution
Amazon Berlin, , Berlin, Germany
Volume
37
Issue
2
fYear
2015
fDate
Feb. 2015
Firstpage
321
Lastpage
333
Abstract
We introduce the four-parameter IBP compound Dirichlet process (ICDP), a stochastic process that generates sparse non-negative vectors with potentially an unbounded number of entries. If we repeatedly sample from the ICDP we can generate sparse matrices with an infinite number of columns and power-law characteristics. We apply the four-parameter ICDP to sparse nonparametric topic modelling to account for the very large number of topics present in large text corpora and the power-law distribution of the vocabulary of natural languages. The model, which we call latent IBP compound Dirichlet allocation (LIDA), allows for power-law distributions, both, in the number of topics summarising the documents and in the number of words defining each topic. It can be interpreted as a sparse variant of the hierarchical Pitman-Yor process when applied to topic modelling. We derive an efficient and simple collapsed Gibbs sampler closely related to the collapsed Gibbs sampler of latent Dirichlet allocation (LDA), making the model applicable in a wide range of domains. Our nonparametric Bayesian topic model compares favourably to the widely used hierarchical Dirichlet process and its heavy tailed version, the hierarchical Pitman-Yor process, on benchmark corpora. Experiments demonstrate that accounting for the power-distribution of real data is beneficial and that sparsity provides more interpretable results.
Keywords
Analytical models; Atomic measurements; Bayes methods; Compounds; Data models; Resource management; Vocabulary; Bayesian nonparametrics; Gibbs sampling; bag-of-words representation; clustering; power-law distribution; sparse modelling; topic modelling;
fLanguage
English
Journal_Title
Pattern Analysis and Machine Intelligence, IEEE Transactions on
Publisher
ieee
ISSN
0162-8828
Type
jour
DOI
10.1109/TPAMI.2014.2313122
Filename
6780626
Link To Document