مرکز منطقه ای اطلاع رساني علوم و فناوري - Comparative Advantage Approach for Sparse Text Data Clustering

DocumentCode :

504055

Title :

Comparative Advantage Approach for Sparse Text Data Clustering

Author :

Ji, Jie ; Chan, Tony Y T ; Zhao, Qiangfu

Author_Institution :

Univ. of Aizu, Aizu-Wakamatsu, Japan

Volume :

fYear :

2009

fDate :

11-14 Oct. 2009

Firstpage :

Lastpage :

Abstract :

Document clustering is the process of partitioning a set of unlabeled n documents into clusters such that documents in each cluster share some common concepts. Each concept is conveniently represented by some key terms. Using words as features, text data are represented as a vector in a very high dimensional vector space. However, most documents are sparse vectors, for example, more than ten thousand dimensions and sparsity of 98%. In this paper, we study a fast classification algorithm based on the idea of comparative advantage for clustering sparse data. The proposed algorithm uses one "ruler" instead of k centers to identify the comparative advantage of each cluster and define the cluster label for each document. Experimental results show that our algorithm has comparable performance but faster than k-means. It can produce clusters with smaller overlapping concepts in the sense of key terms.

Keywords :

pattern classification; pattern clustering; text analysis; vectors; classification algorithm; document clustering; high dimensional vector space; sparse text data clustering; text data; words; Classification algorithms; Clustering algorithms; Frequency; Genetic algorithms; Information technology; Inverse problems; Unsupervised learning; Virtual manufacturing; Document clustering; comparative advantage.; dimension reduction; k-means; key term extraction; sparsity;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Computer and Information Technology, 2009. CIT '09. Ninth IEEE International Conference on

Conference_Location :

Xiamen

Print_ISBN :

978-0-7695-3836-5

Type :

conf

DOI :

10.1109/CIT.2009.22

Filename :

5329159

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=504055