• DocumentCode
    1227199
  • Title

    Stemming via Distribution-Based Word Segregation for Classification and Retrieval

  • Author

    Bhamidipati, Narayan L. ; Pal, Sankar K.

  • Author_Institution
    Machine Intelligence Unit, Indian Stat. Inst., Kolkata
  • Volume
    37
  • Issue
    2
  • fYear
    2007
  • fDate
    4/1/2007 12:00:00 AM
  • Firstpage
    350
  • Lastpage
    360
  • Abstract
    A novel corpus-based method for stemmer refinement, which can provide improvement in both classification and retrieval, is described. The method models the given words as generated from a multinomial distribution over the topics available in the corpus and includes a procedurelike sequential hypothesis testing that enables grouping together distributionally similar words. The system can refine any stemmer, and its strength can be controlled with parameters that reflect the amount of tolerance to be allowed in computing the similarity between the distributions of two words. Although obtaining the morphological roots of the given words is not the primary objective, the algorithm automatically does that to some extent. Despite a huge reduction in dictionary size, classification accuracies are seen to improve significantly when the proposed system is applied on some existing stemmers for classifying 20 Newsgroups and WebKB data. The refinements obtained are also suitable for cross-corpus stemming. Regarding retrieval, its superiority is extensively demonstrated with respect to four existing methods
  • Keywords
    dictionaries; information retrieval; natural languages; pattern classification; text analysis; WebKB data; corpus-based method; cross-corpus stemming; distribution-based word segregation; multinomial distribution; newsgroups data; sequential hypothesis testing; stemmer refinement; text classification; text retrieval; Automatic control; Control systems; Dictionaries; Distributed computing; Information retrieval; Prototypes; Sequential analysis; Testing; Text categorization; Text mining; Precision and recall; prototype selection; stemming; text categorization; Algorithms; Artificial Intelligence; Information Storage and Retrieval; Natural Language Processing; Pattern Recognition, Automated; Vocabulary, Controlled;
  • fLanguage
    English
  • Journal_Title
    Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1083-4419
  • Type

    jour

  • DOI
    10.1109/TSMCB.2006.885307
  • Filename
    4126275