• DocumentCode
    588793
  • Title

    Which Feature is Better? TF*IDF Feature or Topic Feature in Text Clustering

  • Author

    Xiahui Pan ; Jiajun Cheng ; Youqing Xia ; Xin Zhang ; Hui Wang

  • Author_Institution
    Coll. of Inf. Syst. & Manage., Nat. Univ. of Defence Technol., Changsha, China
  • fYear
    2012
  • fDate
    2-4 Nov. 2012
  • Firstpage
    425
  • Lastpage
    428
  • Abstract
    In this paper, we conduct a comparative study on two different text features in text corpus clustering: TF*IDF feature and Topic feature. The former is mainly used in similarity-based text corpus clustering methods, while the latter, which is produced by LDA model, is used to identify the topics of texts. We conduct clustering experiments on 20-newsgroups (20NG) datasets. Based on the dataset, two typical text clustering methods are respectively employed to compare the clustering performance of the above two text features. The experiments demonstrate if the optimal topic number is chosen, the topic feature outperforms in the clustering accuracy.
  • Keywords
    feature extraction; pattern clustering; text analysis; LDA model; TF*IDF feature; dataset; similarity-based text corpus clustering methods; text features; topic feature; Multimedia communication; Security; K-means; LDA; Single-pass; TF*IDF; Text Clustering; topic;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Multimedia Information Networking and Security (MINES), 2012 Fourth International Conference on
  • Conference_Location
    Nanjing
  • Print_ISBN
    978-1-4673-3093-0
  • Type

    conf

  • DOI
    10.1109/MINES.2012.249
  • Filename
    6405714