Title :
Pruning the vocabulary for better context recognition
Author :
Madsen, Rasmus Elsborg ; Sigurdsson, Sigurdur ; Hansen, Lars Kai ; Larsen, Jan
Author_Institution :
Inf. & Math. Modelling, Tech. Univ. Denmark, Lyngby, Denmark
Abstract :
Language independent ´bag-of-words´ representations are surprisingly effective for text classification. The representation is high dimensional though, containing many non-consistent words for text categorization. These non-consistent words result in reduced generalization performance of subsequent classifiers, e.g., from ill-posed principal component transformations. In this communication, our aim is to study the effect of reducing the least relevant words from the bag-of-words representation. We consider a new approach, using neural network based sensitivity maps and information gain for determination of term relevancy, when pruning the vocabularies. With reduced vocabularies, documents are classified using a latent semantic indexing representation and a probabilistic neural network classifier. Reducing the bag-of-words vocabularies with 90%-98%, we find consistent classification improvement using two mid size data-sets. We also study the applicability of information gain and sensitivity maps for automated keyword generation.
Keywords :
indexing; neural nets; pattern classification; principal component analysis; probability; semantic Web; text analysis; vocabulary; bag-of-words representation; bag-of-words vocabularies; context recognition; generalization performance; information gain; nonconsistent words; principal component transformations; probabilistic neural network classifier; semantic indexing representation; sensitivity maps; subsequent classifiers; text categorization; text classification; Databases; Humans; Indexing; Internet; Large scale integration; Learning systems; Machine learning; Neural networks; Text categorization; Vocabulary;
Conference_Titel :
Neural Networks, 2004. Proceedings. 2004 IEEE International Joint Conference on
Print_ISBN :
0-7803-8359-1
DOI :
10.1109/IJCNN.2004.1380163