Title :
Combining word based and word co-occurrence based sequence analysis for text categorization
Author :
Luo, Xiao ; Zincir-Heywood, A. Nur
Author_Institution :
Fac. of Comput. Sci., Dalhousie Univ., Halifax, NS, Canada
Abstract :
This paper represents a text categorization system, which is based on the combination of a hierarchical SOMs encoding architecture and the designed kNN classifier. Through the encoding architecture, a document can be encoded to sequences of neurons so that the sequences of word/word co-occurrence as well as their frequencies are kept. A good performance (micro average F1-measure 0.98) is achieved on the experimental data set by using this system. This sequence analysis system for text categorization could automatically solve the high dimensionality problem for large data set. It could be utilized for other data categorization where sequences information is significant and important.
Keywords :
encoding; neural net architecture; pattern classification; self-organising feature maps; text analysis; document encoding; encoding architecture; kNN classifier; self organization map; text categorization; word cooccurrence based sequence analysis; Computer architecture; Computer science; Content management; Electronic mail; Encoding; Frequency; Information analysis; Machine learning; Neurons; Text categorization;
Conference_Titel :
Machine Learning and Cybernetics, 2004. Proceedings of 2004 International Conference on
Print_ISBN :
0-7803-8403-2
DOI :
10.1109/ICMLC.2004.1382026