Title :
Comparative Advantage Approach for Sparse Text Data Clustering
Author :
Ji, Jie ; Chan, Tony Y T ; Zhao, Qiangfu
Author_Institution :
Univ. of Aizu, Aizu-Wakamatsu, Japan
Abstract :
Document clustering is the process of partitioning a set of unlabeled n documents into clusters such that documents in each cluster share some common concepts. Each concept is conveniently represented by some key terms. Using words as features, text data are represented as a vector in a very high dimensional vector space. However, most documents are sparse vectors, for example, more than ten thousand dimensions and sparsity of 98%. In this paper, we study a fast classification algorithm based on the idea of comparative advantage for clustering sparse data. The proposed algorithm uses one "ruler" instead of k centers to identify the comparative advantage of each cluster and define the cluster label for each document. Experimental results show that our algorithm has comparable performance but faster than k-means. It can produce clusters with smaller overlapping concepts in the sense of key terms.
Keywords :
pattern classification; pattern clustering; text analysis; vectors; classification algorithm; document clustering; high dimensional vector space; sparse text data clustering; text data; words; Classification algorithms; Clustering algorithms; Frequency; Genetic algorithms; Information technology; Inverse problems; Unsupervised learning; Virtual manufacturing; Document clustering; comparative advantage.; dimension reduction; k-means; key term extraction; sparsity;
Conference_Titel :
Computer and Information Technology, 2009. CIT '09. Ninth IEEE International Conference on
Conference_Location :
Xiamen
Print_ISBN :
978-0-7695-3836-5
DOI :
10.1109/CIT.2009.22