Scaling Information-Theoretic Text Clustering: A Sampling-based Approximate Method

Author

Zhexi Xu ; Zhiang Wu ; Jie Cao ; Hengnong Xuan

Author_Institution

Sch. of Inf. Eng., Nanjing Univ. of Finance & Econ., Nanjing, China

fYear

2014

Firstpage

18

Lastpage

25

Abstract

Info-Kmeans, a K-means clustering method employing KL-divergence as the proximity function, is one of the representative methods in information-theoretic clustering. With the explosive growth of online texts such as online reviews and user-generated content, the text is becoming more sparse and much bigger, which poses significant challenges on both effectiveness and efficiency issues of text clustering. In our prior work, we presented a Summation-bAsed Incremental Learning (SAIL) algorithm, which can avoid the zero-feature dilemma of highly sparse texts. In this paper, we propose a sampling-based approximate approach for scaling SAIL algorithm to deal with the large-scale of texts. Particularly, an instance-level random sampling is invoked to reduce the number of instances to be examined during each iteration, which substantially speeds up the clustering on big text data. Furthermore, we prove that the margin of errors introduced by random sampling can be controlled in a small range. Extensive experiments on eight real-life text datasets demonstrate the advantage of the proposed sampling-based approximate clustering method. In particular, our method shows merits in both effectiveness and efficiency on clustering performance.

Keywords

Big Data; information theory; pattern clustering; sampling methods; text analysis; SAIL algorithm; information-theoretic text clustering; instance-level random sampling; sampling-based approximate clustering method; summation-based incremental learning; Algorithm design and analysis; Approximation algorithms; Clustering algorithms; Clustering methods; Indexes; Linear programming; Wireless application protocol; K-means; KL-divergence; Random; Text Clustering;

fLanguage

English

Publisher

ieee

Conference_Titel

Advanced Cloud and Big Data (CBD), 2014 Second International Conference on

Print_ISBN

978-1-4799-8086-4

Type

conf

DOI

10.1109/CBD.2014.56

Filename

7176067