• DocumentCode
    843597
  • Title

    Multitype features coselection for Web document clustering

  • Author

    Huang, Sheng ; Chen, Zheng ; Yu, Yong ; Ma, Wei-Ying

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Shanghai Jiao Tong Univ., China
  • Volume
    18
  • Issue
    4
  • fYear
    2006
  • fDate
    4/1/2006 12:00:00 AM
  • Firstpage
    448
  • Lastpage
    459
  • Abstract
    Feature selection has been widely applied in text categorization and clustering. Compared to unsupervised selection, supervised feature selection is more successful in filtering out noise in most cases. However, due to a lack of label information, clustering can hardly exploit supervised selection. Some studies have proposed to solve this problem by "pseudoclass." As empirical results show, this method is sensitive to selection criteria and data sets. In this paper, we propose a novel feature coselection for Web document clustering, which is called multitype features coselection for clustering (MFCC). MFCC uses intermediate clustering results in one type of feature space to help the selection in other types of feature spaces. Our experiments show that for most selection criteria, MFCC reduces effectively the noise introduced by "pseudoclass," and further improves clustering performance.
  • Keywords
    Internet; classification; data mining; document handling; feature extraction; learning (artificial intelligence); pattern clustering; text analysis; Web document clustering; Web mining; multitype features coselection; supervised feature selection; text categorization; text clustering; unsupervised feature selection; Clustering algorithms; Data mining; Filtering; Information theory; Machine learning; Mel frequency cepstral coefficient; Noise reduction; Text categorization; Text mining; Uniform resource locators; Web mining; clustering; feature evaluation and selection.;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2006.1599384
  • Filename
    1599384