• DocumentCode
    419612
  • Title

    Pruning the vocabulary for better context recognition

  • Author

    Madsen, R.E. ; Sigurdsson, S. ; Hansen, L.K. ; Larsen, J.

  • Author_Institution
    Technical University of Denmark
  • Volume
    2
  • fYear
    2004
  • fDate
    26-26 Aug. 2004
  • Firstpage
    483
  • Lastpage
    488
  • Abstract
    Language independent ´bag-of-words´ representations are surprisingly effective for text classification. The representation is high dimensional though, containing many non-consistent words for text categorization. These non-consistent words result in reduced generalization performance of subsequent classifiers, e.g., from ill-posed principal component transformations. In this communication our aim is to study the effect of reducing the least relevant words from the bag-of-words representation. We consider a new approach, using neural network based sensitivity maps and information gain for determination of term relevancy, when pruning the vocabularies. With reduced vocabularies documents are classified using a latent semantic indexing representation and a probabilistic neural network classifier. Reducing the bag-of-words vocabularies with 90%-98%, we find consistent classification improvement using two mid size data-sets. We also study the applicability of information gain and sensitivity maps for automated keyword generation.
  • Keywords
    Databases; Humans; Indexing; Internet; Large scale integration; Learning systems; Machine learning; Neural networks; Text categorization; Vocabulary;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on
  • Conference_Location
    Cambridge
  • ISSN
    1051-4651
  • Print_ISBN
    0-7695-2128-2
  • Type

    conf

  • DOI
    10.1109/ICPR.2004.1334270
  • Filename
    1334270