• DocumentCode
    2888870
  • Title

    Effective Text Classification by a Supervised Feature Selection Approach

  • Author

    Basu, Tulika ; Murthy, C.A.

  • Author_Institution
    Machine Intell. Unit, Indian Stat. Inst., Kolkata, India
  • fYear
    2012
  • fDate
    10-10 Dec. 2012
  • Firstpage
    918
  • Lastpage
    925
  • Abstract
    The high dimensionality of data is a great challenge for effective text classification. Each document in a document corpus contains many irrelevant and noisy information which eventually reduces the efficiency of text classification. Automatic feature selection methods are extremely important to handle the high dimensionality of data for effective text classification. Feature selection in text classification focuses on identifying relevant information without affecting the accuracy of the classifier. Several feature selection methods have been proposed to improve the classification accuracy by reducing the original feature space. To improve the performance of text classification a new supervised feature selection approach has been proposed which develops a similarity between a term and a class. In this way every term will generate a score based on their similarity with all the classes and then all the terms will be ranked accordingly. The experimental results are presented on several TREC and Reuter data sets using knn classifier. The performances of the classifiers are compared using precision, recall, f-measure and classification accuracy. The proposed term selection approach is compared with document frequency thresholding, information gain, mutual information and chi square statistic. The empirical studies have shown that the proposed method performs significantly better than the other methods.
  • Keywords
    text analysis; TREC data sets; classification accuracy; document corpus; document frequency thresholding; effective text classification; f-measure; information gain; knn classifier; mutual information; reuter data sets; supervised feature selection approach; Accuracy; Complexity theory; Equations; Frequency measurement; Mutual information; Training; Vocabulary; Feature Selection; Text Classification;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining Workshops (ICDMW), 2012 IEEE 12th International Conference on
  • Conference_Location
    Brussels
  • Print_ISBN
    978-1-4673-5164-5
  • Type

    conf

  • DOI
    10.1109/ICDMW.2012.45
  • Filename
    6406548