Title :
Stemming via Distribution-Based Word Segregation for Classification and Retrieval
Author :
Bhamidipati, Narayan L. ; Pal, Sankar K.
Author_Institution :
Machine Intelligence Unit, Indian Stat. Inst., Kolkata
fDate :
4/1/2007 12:00:00 AM
Abstract :
A novel corpus-based method for stemmer refinement, which can provide improvement in both classification and retrieval, is described. The method models the given words as generated from a multinomial distribution over the topics available in the corpus and includes a procedurelike sequential hypothesis testing that enables grouping together distributionally similar words. The system can refine any stemmer, and its strength can be controlled with parameters that reflect the amount of tolerance to be allowed in computing the similarity between the distributions of two words. Although obtaining the morphological roots of the given words is not the primary objective, the algorithm automatically does that to some extent. Despite a huge reduction in dictionary size, classification accuracies are seen to improve significantly when the proposed system is applied on some existing stemmers for classifying 20 Newsgroups and WebKB data. The refinements obtained are also suitable for cross-corpus stemming. Regarding retrieval, its superiority is extensively demonstrated with respect to four existing methods
Keywords :
dictionaries; information retrieval; natural languages; pattern classification; text analysis; WebKB data; corpus-based method; cross-corpus stemming; distribution-based word segregation; multinomial distribution; newsgroups data; sequential hypothesis testing; stemmer refinement; text classification; text retrieval; Automatic control; Control systems; Dictionaries; Distributed computing; Information retrieval; Prototypes; Sequential analysis; Testing; Text categorization; Text mining; Precision and recall; prototype selection; stemming; text categorization; Algorithms; Artificial Intelligence; Information Storage and Retrieval; Natural Language Processing; Pattern Recognition, Automated; Vocabulary, Controlled;
Journal_Title :
Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on
DOI :
10.1109/TSMCB.2006.885307