DocumentCode
2350600
Title
Unsupervised document clustering using multi-resolution latent semantic density analysis
Author
Bellegarda, Jerome R.
Author_Institution
Speech & Language Technol., Apple Inc., Cupertino, CA, USA
fYear
2010
fDate
Aug. 29 2010-Sept. 1 2010
Firstpage
361
Lastpage
366
Abstract
To find meaningful groupings in a given document collection, it is essential to learn the right granularity for the domain, uncover core themes and attendant outliers, and derive suitable labels with which to characterize each of the resulting clusters. The outcome is therefore affected both by the choice of representation and by the behavior of the clustering algorithm. This paper advocates a strategy which combines density-based clustering with latent semantic feature extraction. Documents are first mapped into a latent semantic vector space, and then clustered in that space on the basis of a multi-resolution density measure. Empirical evidence gathered on several document collections suggests that this procedure is effective in identifying semantically sound document clusters.
Keywords
document handling; feature extraction; pattern clustering; density based clustering; document collection; latent semantic feature extraction; latent semantic vector space; multiresolution density measure; multiresolution latent semantic density analysis; unsupervised document clustering; Semantics; density measure; latent semantic mapping; structured document collection; unsupervised clustering; variable resolution;
fLanguage
English
Publisher
ieee
Conference_Titel
Machine Learning for Signal Processing (MLSP), 2010 IEEE International Workshop on
Conference_Location
Kittila
ISSN
1551-2541
Print_ISBN
978-1-4244-7875-0
Electronic_ISBN
1551-2541
Type
conf
DOI
10.1109/MLSP.2010.5587982
Filename
5587982
Link To Document