• DocumentCode
    3227919
  • Title

    Fine Text Categorization: Using Very Aggressive Feature Selection to Cope with Mass Duplicated Features

  • Author

    DAI, Liuling ; HU, Jinwu ; Wu, ShiKun

  • Author_Institution
    Sch. of Comput. Sci., Beijing Inst. of Technol., Beijing
  • Volume
    2
  • fYear
    2008
  • fDate
    20-22 Oct. 2008
  • Firstpage
    984
  • Lastpage
    988
  • Abstract
    Text categorization is a key issue of text mining. Although there are many studies on this problem, the majority of them are focused on classification of rough categories. In this kind of problem, there are obviously different features that can differentiate one category from others. Only very few researches concerned fine text categorization (FTC) problem which is characterized by many duplicated features across different categories. In this paper, we firstly pointed out that traditional feature selection levels canpsilat be directly used to cope with this problem. In order to improve performance, we performed very aggressive feature selection (VAFS) by firstly removing the common features arbitrarily, and then selecting features with modified CHI-square statistic in a very aggressive manner. At last, Only very few features are used to learnt the underlying concepts of categories. Experimental results shows that VAFS improves performance notabely and rule based algorithms are more suitable than vector based algorithms.
  • Keywords
    data mining; knowledge based systems; statistical analysis; text analysis; aggressive feature selection; fine text categorization; mass duplicated features; rough categories; rule based algorithms; text mining; Automation; Information retrieval; Information technology; Laboratories; Machine learning algorithms; Partial response channels; Support vector machine classification; Support vector machines; Text categorization; Text mining; SVM; feature selection; fine text categorization; kNN; rough set;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Intelligent Computation Technology and Automation (ICICTA), 2008 International Conference on
  • Conference_Location
    Hunan
  • Print_ISBN
    978-0-7695-3357-5
  • Type

    conf

  • DOI
    10.1109/ICICTA.2008.90
  • Filename
    4659910