DocumentCode :
1442361
Title :
Understanding Errors in Approximate Distributed Latent Dirichlet Allocation
Author :
Ihler, Alexander ; Newman, David
Author_Institution :
Dept. of Comput. Sci., Univ. of California at Irvine, Irvine, CA, USA
Volume :
24
Issue :
5
fYear :
2012
fDate :
5/1/2012 12:00:00 AM
Firstpage :
952
Lastpage :
960
Abstract :
Latent Dirichlet allocation (LDA) is a popular algorithm for discovering semantic structure in large collections of text or other data. Although its complexity is linear in the data size, its use on increasingly massive collections has created considerable interest in parallel implementations. “Approximate distributed” LDA, or AD-LDA, approximates the popular collapsed Gibbs sampling algorithm for LDA models while running on a distributed architecture. Although this algorithm often appears to perform well in practice, its quality is not well understood theoretically or easily assessed on new data. In this work, we theoretically justify the approximation, and modify AD-LDA to track an error bound on performance. Specifically, we upper bound the probability of making a sampling error at each step of the algorithm (compared to an exact, sequential Gibbs sampler), given the samples drawn thus far. We show empirically that our bound is sufficiently tight to give a meaningful and intuitive measure of approximation error in AD-LDA, allowing the user to track the tradeoff between accuracy and efficiency while executing in parallel.
Keywords :
approximation theory; sampling methods; text analysis; Gibbs sampling algorithm; approximate distributed latent Dirichlet allocation; approximation error; distributed architecture; error understanding; linear complexity; sampling error; semantic structure discovery; text data collection; Approximation algorithms; Approximation error; Computational modeling; Measurement uncertainty; Partitioning algorithms; Data mining; error analysis.; parallel processing; topic model;
fLanguage :
English
Journal_Title :
Knowledge and Data Engineering, IEEE Transactions on
Publisher :
ieee
ISSN :
1041-4347
Type :
jour
DOI :
10.1109/TKDE.2011.29
Filename :
5708149
Link To Document :
بازگشت