Title :
A Clustering Algorithm for Short Documents Based On Concept Similarity
Author :
Peng, Jing ; Yang, Dong-Qing ; Wang, Jian-Wei ; Wu, Meng-Qing ; Wang, Jun-Gang
Author_Institution :
Peking Univ., Beijing
Abstract :
In recent years, there has been an increasing interest in data clustering of short documents. Existing works consider seldom the concept similarity between the words, so the quality of clustering is often very low. This paper proposes a new document-clustering algorithm based on concept similarity in Chinese text processing. Different from tradition method, the algorithm converts text into a words vector space model at first; it splits words into a set of concepts at second; 3rd, it gets the similarity between words through computing the inner products between concepts; 4th, it computes the similarity of text based on the similarity of words. Finally, through two-phased steps, the algorithm finishes the clustering of a specified set of document. The extensive experiments prove the validity and performance of the algorithm.
Keywords :
natural languages; pattern clustering; text analysis; Chinese text processing; concept similarity; data clustering; short document clustering algorithm; Books; Clustering algorithms; Computer science; Data engineering; Data mining; Natural languages; Research and development; Text processing; Web pages;
Conference_Titel :
Communications, Computers and Signal Processing, 2007. PacRim 2007. IEEE Pacific Rim Conference on
Conference_Location :
Victoria, BC
Print_ISBN :
978-1-4244-1189-4
Electronic_ISBN :
1-4244-1190-4
DOI :
10.1109/PACRIM.2007.4313172