Title :
Iterative clustering of high dimensional text data augmented by local search
Author :
Dhillon, Inderjit S. ; Guan, Yuqiang ; Kogan, J.
Author_Institution :
Dept. of Comput. Sci., Texas Univ., Austin, TX, USA
Abstract :
The k-means algorithm with cosine similarity, also known as the spherical k-means algorithm, is a popular method for clustering document collections. However spherical k-means can often yield qualitatively poor results, especially when cluster sizes are small, say 25-30 documents per cluster, where it tends to get stuck at a local maximum far away from the optimal solution. In this paper, we present a local search procedure, which we call \´first-variation" that refines a given clustering by incrementally moving data points between clusters, thus achieving a higher objective function value. An enhancement of first variation allows a chain of such moves in a Kernighan-Lin fashion and leads to a better local maximum. Combining the enhanced first-variation with spherical k-means yields a powerful "ping-pong" strategy that often qualitatively improves k-means clustering and is computationally efficient. We present several experimental results to highlight the improvement achieved by our proposed algorithm in clustering high-dimensional and sparse text data.
Keywords :
data mining; pattern clustering; search problems; text analysis; cosine similarity; document collection clustering; first variation; high dimensional text data; incremental data point movement; iterative clustering; local maximum; local search; objective function value; ping-pong strategy; sparse text data clustering; spherical k-means algorithm; Clustering algorithms; Data mining; Euclidean distance; Frequency; Information retrieval; Iterative algorithms; Mathematics; Refining; Statistics;
Conference_Titel :
Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on
Print_ISBN :
0-7695-1754-4
DOI :
10.1109/ICDM.2002.1183895