Title :
A feature selection algorithm for document clustering based on word co-occurrence frequency
Author :
Liu, Yuan-Chao ; Wang, Xiao-long ; Liu, Bing-quan
Author_Institution :
Sch. of Comput. Sci. & Technol., Harbin Inst. of Technol., China
Abstract :
Constructing feature space by only selecting more informative words can speed up document clustering algorithm greatly, and the cluster quality is not affected. In this paper, firstly, the impact of feature selection on document clustering is discussed, then, a new solution for feature selection was brought forward which is based on word co-occurrence frequency. According to cluster hypothesis, the documents from the same class are more similar to each other when they are represented in vector space model (VSM), so many of the words from these documents are always in company with each other. We find these words by word co-occurrence, and then construct reduced feature space for clustering. Experiments show that the selected features are more salient. Clustering documents in the new reduced feature space, run time is shortened greatly, whereas the cluster quality is almost unchanged, thus make clustering algorithm more suitable for practical use.
Keywords :
feature extraction; pattern clustering; text analysis; vectors; cluster hypothesis; document clustering algorithm; feature selection algorithm; feature space construction; vector space model; word cooccurrence frequency; Clustering algorithms; Computer science; Explosives; Frequency; Internet; Navigation; Partitioning algorithms; Search engines; Space technology; Unsupervised learning;
Conference_Titel :
Machine Learning and Cybernetics, 2004. Proceedings of 2004 International Conference on
Print_ISBN :
0-7803-8403-2
DOI :
10.1109/ICMLC.2004.1378540