DocumentCode
3248828
Title
Iterative clustering of high dimensional text data augmented by local search
Author
Dhillon, Inderjit S. ; Guan, Yuqiang ; Kogan, J.
Author_Institution
Dept. of Comput. Sci., Texas Univ., Austin, TX, USA
fYear
2002
fDate
2002
Firstpage
131
Lastpage
138
Abstract
The k-means algorithm with cosine similarity, also known as the spherical k-means algorithm, is a popular method for clustering document collections. However spherical k-means can often yield qualitatively poor results, especially when cluster sizes are small, say 25-30 documents per cluster, where it tends to get stuck at a local maximum far away from the optimal solution. In this paper, we present a local search procedure, which we call \´first-variation" that refines a given clustering by incrementally moving data points between clusters, thus achieving a higher objective function value. An enhancement of first variation allows a chain of such moves in a Kernighan-Lin fashion and leads to a better local maximum. Combining the enhanced first-variation with spherical k-means yields a powerful "ping-pong" strategy that often qualitatively improves k-means clustering and is computationally efficient. We present several experimental results to highlight the improvement achieved by our proposed algorithm in clustering high-dimensional and sparse text data.
Keywords
data mining; pattern clustering; search problems; text analysis; cosine similarity; document collection clustering; first variation; high dimensional text data; incremental data point movement; iterative clustering; local maximum; local search; objective function value; ping-pong strategy; sparse text data clustering; spherical k-means algorithm; Clustering algorithms; Data mining; Euclidean distance; Frequency; Information retrieval; Iterative algorithms; Mathematics; Refining; Statistics;
fLanguage
English
Publisher
ieee
Conference_Titel
Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on
Print_ISBN
0-7695-1754-4
Type
conf
DOI
10.1109/ICDM.2002.1183895
Filename
1183895
Link To Document