• DocumentCode
    2334845
  • Title

    Document indexing in text categorization

  • Author

    Zhang, Qi-Rui ; Zhang, Ling ; Dong, Shou-Bin ; Tan, Jing-Hua

  • Author_Institution
    Guangdong Key Lab. of Comput. Network, South China Univ. of Technol., Guangzhou, China
  • Volume
    6
  • fYear
    2005
  • fDate
    18-21 Aug. 2005
  • Firstpage
    3792
  • Abstract
    Aiming at the characteristic of text categorization, this paper proposes an improved method of computing term weights, tfidfie, based on the traditional tfidf function that is generally used in most classifiers. In comparison with the tfidf function, the tfidfie function adds an information entropy factor, H, which represents the distribution of documents in the training set in which the term occurs. The experiments show tfidfie outperforms tfidf. In addition, this paper analyses the difference of using information entropy factor H between document categorization and feature selection, also finds that both two phases are all necessary for text categorization, meanwhile it can reach the best performance of classification with up to 70% of the unique terms being removed.
  • Keywords
    classification; indexing; information retrieval; text analysis; vocabulary; document indexing; feature selection; information entropy factor; term weight computing; text categorization; tfidfie function; Classification tree analysis; Computer networks; Frequency; Indexing; Information entropy; Information retrieval; Intelligent networks; Laboratories; Machine learning; Text categorization; Text categorization; document indexing; feature selection;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Machine Learning and Cybernetics, 2005. Proceedings of 2005 International Conference on
  • Conference_Location
    Guangzhou, China
  • Print_ISBN
    0-7803-9091-1
  • Type

    conf

  • DOI
    10.1109/ICMLC.2005.1527600
  • Filename
    1527600