• DocumentCode
    3599997
  • Title

    Scaling Information-Theoretic Text Clustering: A Sampling-based Approximate Method

  • Author

    Zhexi Xu ; Zhiang Wu ; Jie Cao ; Hengnong Xuan

  • Author_Institution
    Sch. of Inf. Eng., Nanjing Univ. of Finance & Econ., Nanjing, China
  • fYear
    2014
  • Firstpage
    18
  • Lastpage
    25
  • Abstract
    Info-Kmeans, a K-means clustering method employing KL-divergence as the proximity function, is one of the representative methods in information-theoretic clustering. With the explosive growth of online texts such as online reviews and user-generated content, the text is becoming more sparse and much bigger, which poses significant challenges on both effectiveness and efficiency issues of text clustering. In our prior work, we presented a Summation-bAsed Incremental Learning (SAIL) algorithm, which can avoid the zero-feature dilemma of highly sparse texts. In this paper, we propose a sampling-based approximate approach for scaling SAIL algorithm to deal with the large-scale of texts. Particularly, an instance-level random sampling is invoked to reduce the number of instances to be examined during each iteration, which substantially speeds up the clustering on big text data. Furthermore, we prove that the margin of errors introduced by random sampling can be controlled in a small range. Extensive experiments on eight real-life text datasets demonstrate the advantage of the proposed sampling-based approximate clustering method. In particular, our method shows merits in both effectiveness and efficiency on clustering performance.
  • Keywords
    Big Data; information theory; pattern clustering; sampling methods; text analysis; SAIL algorithm; information-theoretic text clustering; instance-level random sampling; sampling-based approximate clustering method; summation-based incremental learning; Algorithm design and analysis; Approximation algorithms; Clustering algorithms; Clustering methods; Indexes; Linear programming; Wireless application protocol; K-means; KL-divergence; Random; Text Clustering;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Advanced Cloud and Big Data (CBD), 2014 Second International Conference on
  • Print_ISBN
    978-1-4799-8086-4
  • Type

    conf

  • DOI
    10.1109/CBD.2014.56
  • Filename
    7176067