• DocumentCode
    827764
  • Title

    Distributional Features for Text Categorization

  • Author

    Xue, Xiao-Bing ; Zhou, Zhi-Hua

  • Author_Institution
    Nanjing Univ., Nanjing
  • Volume
    21
  • Issue
    3
  • fYear
    2009
  • fDate
    3/1/2009 12:00:00 AM
  • Firstpage
    428
  • Lastpage
    442
  • Abstract
    Text categorization is the task of assigning predefined categories to natural language text. With the widely used ´bag of words´ representation, previous researches usually assign a word with values such that whether this word appears in the document concerned or how frequently this word appears. Although these values are useful for text categorization, they have not fully expressed the abundant information contained in the document. This paper explores the effect of other types of values, which express the distribution of a word in the document. These novel values assigned to a word are called distributional features, which include the compactness of the appearances of the word and the position of the first appearance of the word. The proposed distributional features are exploited by a tf idf style equation and different features are combined using ensemble learning techniques. Experiments show that the distributional features are useful for text categorization. In contrast to using the traditional term frequency values solely, including the distributional features requires only a little additional cost, while the categorization performance can be significantly improved. Further analysis shows that the distributional features are especially useful when documents are long and the writing style is casual.
  • Keywords
    learning (artificial intelligence); natural languages; text analysis; ensemble learning technique; natural language text; predefined category assignment; text categorization distributional feature; Data mining; Modeling structured; Text mining; textual and multimedia data;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2008.166
  • Filename
    4589210