• DocumentCode
    387564
  • Title

    Relative term-frequency based feature selection for text categorization

  • Author

    Yang, Stewart M. ; Wu, Xiao-Bin ; Deng, Zhi-Hong ; Zhang, Ming ; Dong-Qing Yang

  • Author_Institution
    Dept. of Comput. Sci. & Technol., Peking Univ., Beijing, China
  • Volume
    3
  • fYear
    2002
  • fDate
    2002
  • Firstpage
    1432
  • Abstract
    Automatic feature selection methods such as document frequency, information gain, mutual information and so on are commonly applied in the preprocess of text categorization in order to reduce the originally high feature dimension to a bearable level, meanwhile also reduce the noise to improve precision. Generally they assess a specific term by calculating its occurrences among individual categories or in the entire corpus, where "occurring in a document" is simply defined as occurring at least once. A major drawback of this measure is that, for a single document, it might count a recurrent term the same as a rare term, while the former term is obviously more informative and should less likely be removed. In this paper we propose a possible approach to overcome this problem, which adjusts the occurrences count according to the relative term frequency, thus stressing those recurrent words in each document. While it can be applied to all feature selection methods, we implemented it on several of them and see notable improvements in the performances.
  • Keywords
    classification; feature extraction; information retrieval; learning (artificial intelligence); automatic feature selection; classification; document frequency; information gain; machine learning; mutual information; nearest neighbor classifier; relative term frequency; text categorization; Computer science; Feature extraction; Frequency measurement; Gain measurement; Machine learning; Mutual information; Neural networks; Noise level; Noise reduction; Text categorization;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Machine Learning and Cybernetics, 2002. Proceedings. 2002 International Conference on
  • Print_ISBN
    0-7803-7508-4
  • Type

    conf

  • DOI
    10.1109/ICMLC.2002.1167443
  • Filename
    1167443