• DocumentCode
    3165886
  • Title

    A Generalization of Proximity Functions for K-Means

  • Author

    Wu, Junjie ; Xiong, Hui ; Chen, Jian ; Zhou, Wenjun

  • Author_Institution
    Tsinghua Univ., Beijing
  • fYear
    2007
  • fDate
    28-31 Oct. 2007
  • Firstpage
    361
  • Lastpage
    370
  • Abstract
    K-means is a widely used partitional clustering method. A large amount of effort has been made on finding better proximity (distance) functions for k-means. However, the common characteristics of proximity functions remain unknown. To this end, in this paper, we show that all proximity functions that fit k-means clustering can be generalized as k-means distance, which can be derived by a differentiable convex function. A general proof of sufficient and necessary conditions for k-means distance functions is also provided. In addition, we reveal that k-means has a general uniformization effect; that is, k-means tends to produce clusters with relatively balanced cluster sizes. This uniformization effect of k-means exists regardless of proximity functions. Finally, we have conducted extensive experiments on various real-world data sets, and the results show the evidence of the uniformization effect. Also, we observed that external clustering validation measures, such as entropy and variance of information (VI), have difficulty in measuring clustering quality if data have skewed distributions on class sizes.
  • Keywords
    convex programming; entropy; pattern clustering; differentiable convex function; entropy; information variance; k-means clustering; k-means distance functions; partitional clustering method; proximity functions; Clustering algorithms; Conference management; Data mining; Educational institutions; Electronic mail; Entropy; Information management; Size measurement; Statistical analysis; Statistical distributions;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on
  • Conference_Location
    Omaha, NE
  • ISSN
    1550-4786
  • Print_ISBN
    978-0-7695-3018-5
  • Type

    conf

  • DOI
    10.1109/ICDM.2007.59
  • Filename
    4470260