Document Clustering with Semantic Analysis

Author

Wang, Yong ; Hodges, Julia

Author_Institution

Mississippi State University

Volume

3

fYear

2006

fDate

04-07 Jan. 2006

Abstract

Document clustering generates clusters from the whole document collection automatically and is used in many fields, including data mining and information retrieval. In the traditional vector space model, the unique words occurring in the document set are used as the features. But because of the synonym problem and the polysemous problem, such a bag of original words cannot represent the content of a document precisely. In this paper, we investigate using the sense disambiguation method to identify the sense of words to construct the feature vector for document representation. Our experimental results demonstrate that in most conditions, using sense can improve the performance of our document clustering system. But the comprehensive statistical analysis performed indicates that the differences between using original single words and using senses of words are not statistically significant. In this paper, we also provide an evaluation of several basic clustering algorithms for algorithm selection.

Keywords

Clustering algorithms; Computer science; Data engineering; Data mining; Databases; Frequency; Information retrieval; Partitioning algorithms; Statistical analysis; Thesauri;

fLanguage

English

Publisher

ieee

Conference_Titel

System Sciences, 2006. HICSS '06. Proceedings of the 39th Annual Hawaii International Conference on

ISSN

1530-1605

Print_ISBN

0-7695-2507-5

Type

conf

DOI

10.1109/HICSS.2006.129

Filename

1579400