Title :
Document clustering based on diffusion maps and a comparison of the k-means performances in various spaces
Author :
Allah, Fadoua Ataa ; Grosky, William I. ; Aboutajdine, Driss
Author_Institution :
GSCM-LRIT Lab., Mohamed V-Agdal Univ., Rabat
Abstract :
A great challenge of text mining arises from the increasingly large text datasets and the high dimensionality associated with natural language. In this research, a systematic study is conducted in the context of the document clustering, using the recently introduced diffusion framework and some characteristics of the singular value decomposition. This study is three-fold. First, we propose to construct a diffusion kernel based on the cosine distance. Second, we compare the performances of the k-means algorithm in four different vector spaces: Salton space, latent semantic analysis space, diffusion space based on the cosine distance, and diffusion space based on the Euclidian distance. Third, we undertake a statistical study of the k-means algorithm in the LSA space and the diffusion space based on the cosine distance. In most of our experiments, k-means in diffusion space, based on the cosine distance performs better. In addition, the running time in this space is negligible compared to the time needed for k-means in Salton space.
Keywords :
natural language processing; pattern clustering; singular value decomposition; text analysis; Euclidian distance; Salton space; cosine distance; diffusion maps; document clustering; k-means performances; natural language; singular value decomposition; text datasets; text mining; vector spaces; Algorithm design and analysis; Clustering algorithms; Computational Intelligence Society; Functional analysis; Kernel; Laboratories; Natural languages; Performance analysis; Singular value decomposition; Text mining;
Conference_Titel :
Computers and Communications, 2008. ISCC 2008. IEEE Symposium on
Conference_Location :
Marrakech
Print_ISBN :
978-1-4244-2702-4
Electronic_ISBN :
1530-1346
DOI :
10.1109/ISCC.2008.4625693