DocumentCode :
2776190
Title :
Probabilistic Metrics for Soft-Clustering and Topic Model Validation
Author :
Ramirez, Eduardo H. ; Brena, Ramon ; Magatti, Davide ; Stella, Fabio
Author_Institution :
Center for Intell. Syst., Tecnol. de Monterrey, Monterrey, Mexico
Volume :
1
fYear :
2010
fDate :
Aug. 31 2010-Sept. 3 2010
Firstpage :
406
Lastpage :
412
Abstract :
In this paper the problem of performing external validation of the semantic coherence of topic models is considered. The Fowlkes-Mallows index, a known clustering validation metric, is generalized for the case of overlapping partitions and multi-labeled collections, thus making it suitable for validating topic modeling algorithms. In addition, we propose new probabilistic metrics inspired by the concepts of recall and precision. The proposed metrics also have clear probabilistic interpretations and can be applied to validate and compare other soft and overlapping clustering algorithms. The approach is exemplified by using the Reuters-21578 multi-labeled collection to validate LDA models, then using Monte Carlo simulations to show the convergence to the predicted results. Additional statistical evidence is provided to better understand the relation of the metrics presented.
Keywords :
Monte Carlo methods; pattern clustering; probability; text analysis; Fowlkes-Mallows index; LDA models; Monte Carlo simulations; Reuters-21578 multilabeled collection; clustering validation metric; overlapping clustering algorithms; probabilistic interpretations; probabilistic metrics; soft-clustering algorithms; topic model validation; cluster validation metrics; external validation; soft-clustering; topic modeling;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM International Conference on
Conference_Location :
Toronto, ON
Print_ISBN :
978-1-4244-8482-9
Electronic_ISBN :
978-0-7695-4191-4
Type :
conf
DOI :
10.1109/WI-IAT.2010.148
Filename :
5616623
Link To Document :
بازگشت