DocumentCode
1227199
Title
Stemming via Distribution-Based Word Segregation for Classification and Retrieval
Author
Bhamidipati, Narayan L. ; Pal, Sankar K.
Author_Institution
Machine Intelligence Unit, Indian Stat. Inst., Kolkata
Volume
37
Issue
2
fYear
2007
fDate
4/1/2007 12:00:00 AM
Firstpage
350
Lastpage
360
Abstract
A novel corpus-based method for stemmer refinement, which can provide improvement in both classification and retrieval, is described. The method models the given words as generated from a multinomial distribution over the topics available in the corpus and includes a procedurelike sequential hypothesis testing that enables grouping together distributionally similar words. The system can refine any stemmer, and its strength can be controlled with parameters that reflect the amount of tolerance to be allowed in computing the similarity between the distributions of two words. Although obtaining the morphological roots of the given words is not the primary objective, the algorithm automatically does that to some extent. Despite a huge reduction in dictionary size, classification accuracies are seen to improve significantly when the proposed system is applied on some existing stemmers for classifying 20 Newsgroups and WebKB data. The refinements obtained are also suitable for cross-corpus stemming. Regarding retrieval, its superiority is extensively demonstrated with respect to four existing methods
Keywords
dictionaries; information retrieval; natural languages; pattern classification; text analysis; WebKB data; corpus-based method; cross-corpus stemming; distribution-based word segregation; multinomial distribution; newsgroups data; sequential hypothesis testing; stemmer refinement; text classification; text retrieval; Automatic control; Control systems; Dictionaries; Distributed computing; Information retrieval; Prototypes; Sequential analysis; Testing; Text categorization; Text mining; Precision and recall; prototype selection; stemming; text categorization; Algorithms; Artificial Intelligence; Information Storage and Retrieval; Natural Language Processing; Pattern Recognition, Automated; Vocabulary, Controlled;
fLanguage
English
Journal_Title
Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on
Publisher
ieee
ISSN
1083-4419
Type
jour
DOI
10.1109/TSMCB.2006.885307
Filename
4126275
Link To Document