DocumentCode :
1867625
Title :
An Information-Theoretic Approach for Unsupervised Topic Mining in Large Text Collections
Author :
Ramirez, Eduardo H. ; Brena, Ramon F.
Volume :
1
fYear :
2009
fDate :
15-18 Sept. 2009
Firstpage :
331
Lastpage :
334
Abstract :
In this paper we focus on the task of identifying topics in large text collections in a completely unsupervised way. In contrast to probabilistic topic modeling methods that require first estimating the density of probability distributions, we model topics as subsets of terms that are used as queries to an index of documents. By retrieving the documents relevant to those topical-queries we obtain overlapping clusters of semantically similar documents. In order to find the topical-queries we generate candidate queries using signature-calculation heuristics such as those used on duplicate-detection methods and then evaluate candidates using an information-gain function defined as "semantic force". The method is targeted to the semantic analysis of collections sized in the order of millions of documents, so, it has been implemented in map-reduce style. We present some initial results to support the feasibility of the approach.
Keywords :
Conferences; Data mining; Intelligent agent; Intelligent systems; Large scale integration; Linear discriminant analysis; Parameter estimation; Performance analysis; Probability distribution; Scalability; data mining; topic modeling; unsupervised learning;
fLanguage :
English
Publisher :
iet
Conference_Titel :
Web Intelligence and Intelligent Agent Technologies, 2009. WI-IAT '09. IEEE/WIC/ACM International Joint Conferences on
Conference_Location :
Milan, Italy
Print_ISBN :
978-0-7695-3801-3
Electronic_ISBN :
978-1-4244-5331-3
Type :
conf
DOI :
10.1109/WI-IAT.2009.58
Filename :
5286050
Link To Document :
بازگشت