• DocumentCode
    3248789
  • Title

    Feature selection for clustering - a filter solution

  • Author

    Dash, Manoranjan ; Choi, Kiseok ; Scheuermann, Peter ; Liu, Huan

  • Author_Institution
    Dept. of Electr. & Comput. Eng., Northwestern Univ., Evanston, IL, USA
  • fYear
    2002
  • fDate
    2002
  • Firstpage
    115
  • Lastpage
    122
  • Abstract
    Processing applications with a large number of dimensions has been a challenge for the KDD community. Feature selection, an effective dimensionality reduction technique, is an essential pre-processing method to remove noisy features. In the literature only a few methods have been proposed for feature selection for clustering, and almost all these methods are ´wrapper´ techniques that require a clustering algorithm to evaluate candidate feature subsets. The wrapper approach is largely unsuitable in real-world applications due to its heavy reliance on clustering algorithms that require parameters such as the number of clusters, and the lack of suitable clustering criteria to evaluate clustering in different subspaces. In this paper we propose a ´filter´ method that is independent of any clustering algorithm. The proposed method is based on the observation that data with clusters has a very different point-to-point distance histogram to that of data without clusters. By exploiting this we propose an entropy measure that is low if data has distinct clusters and high if it does not. The entropy measure is suitable for selecting the most important subset of features because it is invariant with the number of dimensions, and is affected only by the quality of clustering. Extensive performance evaluation over synthetic, benchmark, and real datasets shows its effectiveness.
  • Keywords
    data mining; entropy; feature extraction; pattern clustering; clustering; dimensionality reduction technique; entropy measure; feature selection; filter method; knowledge discovery in databases; noisy feature removal; point-to-point distance histogram; pre-processing method; Clustering algorithms; Degradation; Entropy; Filters; Histograms; Noise reduction; Unsupervised learning;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on
  • Print_ISBN
    0-7695-1754-4
  • Type

    conf

  • DOI
    10.1109/ICDM.2002.1183893
  • Filename
    1183893