Title :
Integrating phrases to enhance HSOMART-based document clustering
Author :
Hussin, Mahmoud F. ; Kamel, Mohamed S.
Author_Institution :
Dept. of Comput. Sci. & Autom. Control, Alexandria Univ., Egypt
Abstract :
Document clustering is one of the popular techniques that assist users in organizing collections of documents. Two successful models of unsupervised neural networks, self-organizing map (SOM) and adaptive resonance theory (ART), have shown promising results in this task. Most of the existing neural network based document clustering techniques rely on a "bag of words" document representation. Each word in the document is considered as a separate feature, ignoring the word order. We investigate the use of phrases rather than words as document features applied to our proposed document clustering technique, called hierarchical SOMART (HSOMART), which is a hierarchical network built up from independent SOM and ART neural networks. We describe a phrase grammar extraction technique, and the proposed HSOMART. The experimental results of clustering documents from the REUTERS corpus using the extracted phrases as features show an improvement in the clustering performance evaluated using the entropy and F-measure.
Keywords :
ART neural nets; document handling; entropy; pattern clustering; self-organising feature maps; tree data structures; unsupervised learning; ART; F-measure; REUTERS corpus; adaptive resonance theory; bag of words; document clustering techniques; document representation; entropy; feature extraction; hierarchical network; phrase grammar extraction technique; phrase integration; self organizing map; unsupervised neural networks; Automatic control; Clustering algorithms; Computer science; Feature extraction; Neural networks; Organizing; Resonance; Subspace constraints; Text categorization; Tree graphs;
Conference_Titel :
Neural Networks, 2004. Proceedings. 2004 IEEE International Joint Conference on
Print_ISBN :
0-7803-8359-1
DOI :
10.1109/IJCNN.2004.1380993