DocumentCode :
3120498
Title :
A study on criteria for extracting key terms in document clustering
Author :
Ji, Jie ; Zhao, Qiangfu ; Shindo, Ryouhei ; Kunishi, Yousuke
Author_Institution :
Univ. of Aizu, Aizu-Wakamatsu
fYear :
2008
fDate :
12-15 Oct. 2008
Firstpage :
3674
Lastpage :
3679
Abstract :
Document clustering is the process to partition a set of unlabelled documents into some clusters. To analyze the documents efficiently and effectively, it is expected that all documents in each cluster have some shared concept. The shared concept is most conveniently represented using some key terms. Many methods have been studied for selecting important key terms. However, most of them belong to the category of supervised learning. That is, the teacher signals must be provided in advance in order to measure the importance of the key terms. In this paper, we study un-supervised learning only. Specifically, we study three criteria for extracting important key terms through clustering. The first one is the mean squared error (MSE) function. It is well known that clusters obtained based on MSE are good in the sense that all documents in each cluster are similar. In addition to MSE, we introduce two new criteria. Both criteria encourage each cluster to use a different set of key terms. Experimental results with three databases show that MSE, although simple, is surprisingly good for generating representative key terms. One advantage of the proposed criteria is that they can generate more balanced clusters.
Keywords :
document handling; least mean squares methods; pattern clustering; unsupervised learning; document clustering; key term extraction; mean squared error function; unlabelled documents; unsupervised learning; Databases; Flowcharts; Frequency; Polymers; Supervised learning; Unsupervised learning; Virtual manufacturing; Document clustering; criterion function; k-means; key term extraction;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Systems, Man and Cybernetics, 2008. SMC 2008. IEEE International Conference on
Conference_Location :
Singapore
ISSN :
1062-922X
Print_ISBN :
978-1-4244-2383-5
Electronic_ISBN :
1062-922X
Type :
conf
DOI :
10.1109/ICSMC.2008.4811870
Filename :
4811870
Link To Document :
بازگشت