DocumentCode
424030
Title
Integrating phrases to enhance HSOMART-based document clustering
Author
Hussin, Mahmoud F. ; Kamel, Mohamed S.
Author_Institution
Dept. of Comput. Sci. & Autom. Control, Alexandria Univ., Egypt
Volume
3
fYear
2004
fDate
25-29 July 2004
Firstpage
2347
Abstract
Document clustering is one of the popular techniques that assist users in organizing collections of documents. Two successful models of unsupervised neural networks, self-organizing map (SOM) and adaptive resonance theory (ART), have shown promising results in this task. Most of the existing neural network based document clustering techniques rely on a "bag of words" document representation. Each word in the document is considered as a separate feature, ignoring the word order. We investigate the use of phrases rather than words as document features applied to our proposed document clustering technique, called hierarchical SOMART (HSOMART), which is a hierarchical network built up from independent SOM and ART neural networks. We describe a phrase grammar extraction technique, and the proposed HSOMART. The experimental results of clustering documents from the REUTERS corpus using the extracted phrases as features show an improvement in the clustering performance evaluated using the entropy and F-measure.
Keywords
ART neural nets; document handling; entropy; pattern clustering; self-organising feature maps; tree data structures; unsupervised learning; ART; F-measure; REUTERS corpus; adaptive resonance theory; bag of words; document clustering techniques; document representation; entropy; feature extraction; hierarchical network; phrase grammar extraction technique; phrase integration; self organizing map; unsupervised neural networks; Automatic control; Clustering algorithms; Computer science; Feature extraction; Neural networks; Organizing; Resonance; Subspace constraints; Text categorization; Tree graphs;
fLanguage
English
Publisher
ieee
Conference_Titel
Neural Networks, 2004. Proceedings. 2004 IEEE International Joint Conference on
ISSN
1098-7576
Print_ISBN
0-7803-8359-1
Type
conf
DOI
10.1109/IJCNN.2004.1380993
Filename
1380993
Link To Document