• DocumentCode
    1309483
  • Title

    Text Clustering with Seeds Affinity Propagation

  • Author

    Guan, Renchu ; Shi, Xiaohu ; Marchese, Maurizio ; Yang, Chen ; Liang, Yanchun

  • Author_Institution
    Coll. of Comput. Sci. & Technol., Jilin Univ., Changchun, China
  • Volume
    23
  • Issue
    4
  • fYear
    2011
  • fDate
    4/1/2011 12:00:00 AM
  • Firstpage
    627
  • Lastpage
    637
  • Abstract
    Based on an effective clustering algorithm-Affinity Propagation (AP)-we present in this paper a novel semisupervised text clustering algorithm, called Seeds Affinity Propagation (SAP). There are two main contributions in our approach: 1) a new similarity metric that captures the structural information of texts, and 2) a novel seed construction method to improve the semisupervised clustering process. To study the performance of the new algorithm, we applied it to the benchmark data set Reuters-21578 and compared it to two state-of-the-art clustering algorithms, namely, k-means algorithm and the original AP algorithm. Furthermore, we have analyzed the individual impact of the two proposed contributions. Results show that the proposed similarity metric is more effective in text clustering (F-measures ca. 21 percent higher than in the AP algorithm) and the proposed semisupervised strategy achieves both better clustering results and faster convergence (using only 76 percent iterations of the original AP). The complete SAP algorithm obtains higher F-measure (ca. 40 percent improvement over k-means and AP) and lower entropy (ca. 28 percent decrease over k-means and AP), improves significantly clustering execution time (20 times faster) in respect that k-means, and provides enhanced robustness compared with all other methods.
  • Keywords
    pattern clustering; text analysis; F-measure; Reuters-21578 data set; seed construction method; seeds affinity propagation; semisupervised text clustering algorithm; similarity metric; Affinity propagation; cofeature set; significant cofeature set.; text clustering; unilateral feature set;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2010.144
  • Filename
    5560648