A Clustering Algorithm for Short Documents Based On Concept Similarity

Author

Peng, Jing ; Yang, Dong-Qing ; Wang, Jian-Wei ; Wu, Meng-Qing ; Wang, Jun-Gang

Author_Institution

Peking Univ., Beijing

fYear

2007

fDate

22-24 Aug. 2007

Firstpage

42

Lastpage

45

Abstract

In recent years, there has been an increasing interest in data clustering of short documents. Existing works consider seldom the concept similarity between the words, so the quality of clustering is often very low. This paper proposes a new document-clustering algorithm based on concept similarity in Chinese text processing. Different from tradition method, the algorithm converts text into a words vector space model at first; it splits words into a set of concepts at second; 3rd, it gets the similarity between words through computing the inner products between concepts; 4th, it computes the similarity of text based on the similarity of words. Finally, through two-phased steps, the algorithm finishes the clustering of a specified set of document. The extensive experiments prove the validity and performance of the algorithm.

Keywords

natural languages; pattern clustering; text analysis; Chinese text processing; concept similarity; data clustering; short document clustering algorithm; Books; Clustering algorithms; Computer science; Data engineering; Data mining; Natural languages; Research and development; Text processing; Web pages;

fLanguage

English

Publisher

ieee

Conference_Titel

Communications, Computers and Signal Processing, 2007. PacRim 2007. IEEE Pacific Rim Conference on

Conference_Location

Victoria, BC

Print_ISBN

978-1-4244-1189-4

Electronic_ISBN

1-4244-1190-4

Type

conf

DOI

10.1109/PACRIM.2007.4313172

Filename

4313172