DocumentCode :
2350600
Title :
Unsupervised document clustering using multi-resolution latent semantic density analysis
Author :
Bellegarda, Jerome R.
Author_Institution :
Speech & Language Technol., Apple Inc., Cupertino, CA, USA
fYear :
2010
fDate :
Aug. 29 2010-Sept. 1 2010
Firstpage :
361
Lastpage :
366
Abstract :
To find meaningful groupings in a given document collection, it is essential to learn the right granularity for the domain, uncover core themes and attendant outliers, and derive suitable labels with which to characterize each of the resulting clusters. The outcome is therefore affected both by the choice of representation and by the behavior of the clustering algorithm. This paper advocates a strategy which combines density-based clustering with latent semantic feature extraction. Documents are first mapped into a latent semantic vector space, and then clustered in that space on the basis of a multi-resolution density measure. Empirical evidence gathered on several document collections suggests that this procedure is effective in identifying semantically sound document clusters.
Keywords :
document handling; feature extraction; pattern clustering; density based clustering; document collection; latent semantic feature extraction; latent semantic vector space; multiresolution density measure; multiresolution latent semantic density analysis; unsupervised document clustering; Semantics; density measure; latent semantic mapping; structured document collection; unsupervised clustering; variable resolution;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Machine Learning for Signal Processing (MLSP), 2010 IEEE International Workshop on
Conference_Location :
Kittila
ISSN :
1551-2541
Print_ISBN :
978-1-4244-7875-0
Electronic_ISBN :
1551-2541
Type :
conf
DOI :
10.1109/MLSP.2010.5587982
Filename :
5587982
Link To Document :
بازگشت