Title :
Unsupervised document clustering using multi-resolution latent semantic density analysis
Author :
Bellegarda, Jerome R.
Author_Institution :
Speech & Language Technol., Apple Inc., Cupertino, CA, USA
fDate :
Aug. 29 2010-Sept. 1 2010
Abstract :
To find meaningful groupings in a given document collection, it is essential to learn the right granularity for the domain, uncover core themes and attendant outliers, and derive suitable labels with which to characterize each of the resulting clusters. The outcome is therefore affected both by the choice of representation and by the behavior of the clustering algorithm. This paper advocates a strategy which combines density-based clustering with latent semantic feature extraction. Documents are first mapped into a latent semantic vector space, and then clustered in that space on the basis of a multi-resolution density measure. Empirical evidence gathered on several document collections suggests that this procedure is effective in identifying semantically sound document clusters.
Keywords :
document handling; feature extraction; pattern clustering; density based clustering; document collection; latent semantic feature extraction; latent semantic vector space; multiresolution density measure; multiresolution latent semantic density analysis; unsupervised document clustering; Semantics; density measure; latent semantic mapping; structured document collection; unsupervised clustering; variable resolution;
Conference_Titel :
Machine Learning for Signal Processing (MLSP), 2010 IEEE International Workshop on
Conference_Location :
Kittila
Print_ISBN :
978-1-4244-7875-0
Electronic_ISBN :
1551-2541
DOI :
10.1109/MLSP.2010.5587982