Title :
SAIL: Summation-bAsed Incremental Learning for Information-Theoretic Text Clustering
Author :
Jie Cao ; Zhiang Wu ; Junjie Wu ; Hui Xiong
Author_Institution :
Jiangsu Provincial Key Lab. of E-Bus., Nanjing Univ. of Finance & Econ., Nanjing, China
Abstract :
Information-theoretic clustering aims to exploit information-theoretic measures as the clustering criteria. A common practice on this topic is the so-called Info-Kmeans, which performs K-means clustering with KL-divergence as the proximity function. While expert efforts on Info-Kmeans have shown promising results, a remaining challenge is to deal with high-dimensional sparse data such as text corpora. Indeed, it is possible that the centroids contain many zero-value features for high-dimensional text vectors, which leads to infinite KL-divergence values and creates a dilemma in assigning objects to centroids during the iteration process of Info-Kmeans. To meet this challenge, in this paper, we propose a Summation-bAsed Incremental Learning (SAIL) algorithm for Info-Kmeans clustering. Specifically, by using an equivalent objective function, SAIL replaces the computation of KL-divergence by the incremental computation of Shannon entropy. This can avoid the zero-feature dilemma caused by the use of KL-divergence. To improve the clustering quality, we further introduce the variable neighborhood search scheme and propose the V-SAIL algorithm, which is then accelerated by a multithreaded scheme in PV-SAIL. Our experimental results on various real-world text collections have shown that, with SAIL as a booster, the clustering performance of Info-Kmeans can be significantly improved. Also, V-SAIL and PV-SAIL indeed help improve the clustering quality at a lower cost of computation.
Keywords :
entropy; iterative methods; learning (artificial intelligence); multi-threading; pattern clustering; search problems; text analysis; Info-Kmeans clustering; K-means clustering; KL-divergence; PV-SAIL; Shannon entropy; centroids; equivalent objective function; incremental computation; information theory; iteration process; multithreaded scheme; neighborhood search scheme; object assignment; proximity function; sparse data; summation-based incremental learning; text clustering; text collection; text vector; zero value feature; Clustering algorithms; Educational institutions; Entropy; Linear programming; Mutual information; Probabilistic logic; Vectors; Information-theoretic clustering; K-means distance; KL-divergence; multithreaded parallel computing; variable neighborhood search (VNS);
Journal_Title :
Cybernetics, IEEE Transactions on
DOI :
10.1109/TSMCB.2012.2212430