DocumentCode
588793
Title
Which Feature is Better? TF*IDF Feature or Topic Feature in Text Clustering
Author
Xiahui Pan ; Jiajun Cheng ; Youqing Xia ; Xin Zhang ; Hui Wang
Author_Institution
Coll. of Inf. Syst. & Manage., Nat. Univ. of Defence Technol., Changsha, China
fYear
2012
fDate
2-4 Nov. 2012
Firstpage
425
Lastpage
428
Abstract
In this paper, we conduct a comparative study on two different text features in text corpus clustering: TF*IDF feature and Topic feature. The former is mainly used in similarity-based text corpus clustering methods, while the latter, which is produced by LDA model, is used to identify the topics of texts. We conduct clustering experiments on 20-newsgroups (20NG) datasets. Based on the dataset, two typical text clustering methods are respectively employed to compare the clustering performance of the above two text features. The experiments demonstrate if the optimal topic number is chosen, the topic feature outperforms in the clustering accuracy.
Keywords
feature extraction; pattern clustering; text analysis; LDA model; TF*IDF feature; dataset; similarity-based text corpus clustering methods; text features; topic feature; Multimedia communication; Security; K-means; LDA; Single-pass; TF*IDF; Text Clustering; topic;
fLanguage
English
Publisher
ieee
Conference_Titel
Multimedia Information Networking and Security (MINES), 2012 Fourth International Conference on
Conference_Location
Nanjing
Print_ISBN
978-1-4673-3093-0
Type
conf
DOI
10.1109/MINES.2012.249
Filename
6405714
Link To Document