DocumentCode :
588793
Title :
Which Feature is Better? TF*IDF Feature or Topic Feature in Text Clustering
Author :
Xiahui Pan ; Jiajun Cheng ; Youqing Xia ; Xin Zhang ; Hui Wang
Author_Institution :
Coll. of Inf. Syst. & Manage., Nat. Univ. of Defence Technol., Changsha, China
fYear :
2012
fDate :
2-4 Nov. 2012
Firstpage :
425
Lastpage :
428
Abstract :
In this paper, we conduct a comparative study on two different text features in text corpus clustering: TF*IDF feature and Topic feature. The former is mainly used in similarity-based text corpus clustering methods, while the latter, which is produced by LDA model, is used to identify the topics of texts. We conduct clustering experiments on 20-newsgroups (20NG) datasets. Based on the dataset, two typical text clustering methods are respectively employed to compare the clustering performance of the above two text features. The experiments demonstrate if the optimal topic number is chosen, the topic feature outperforms in the clustering accuracy.
Keywords :
feature extraction; pattern clustering; text analysis; LDA model; TF*IDF feature; dataset; similarity-based text corpus clustering methods; text features; topic feature; Multimedia communication; Security; K-means; LDA; Single-pass; TF*IDF; Text Clustering; topic;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Multimedia Information Networking and Security (MINES), 2012 Fourth International Conference on
Conference_Location :
Nanjing
Print_ISBN :
978-1-4673-3093-0
Type :
conf
DOI :
10.1109/MINES.2012.249
Filename :
6405714
Link To Document :
بازگشت