DocumentCode :
39749
Title :
The Role of Hubness in Clustering High-Dimensional Data
Author :
Tomasev, Nenad ; Radovanovic, Milos ; Mladenic, Dunja ; Ivanovic, Mirjana
Author_Institution :
Jozef Stefan Inst., Artificial Intell. Lab., Jozef Stefan Inst., Ljubljana, Slovenia
Volume :
26
Issue :
3
fYear :
2014
fDate :
Mar-14
Firstpage :
739
Lastpage :
751
Abstract :
High-dimensional data arise naturally in many domains, and have regularly presented a great challenge for traditional data mining techniques, both in terms of effectiveness and efficiency. Clustering becomes difficult due to the increasing sparsity of such data, as well as the increasing difficulty in distinguishing distances between data points. In this paper, we take a novel perspective on the problem of clustering high-dimensional data. Instead of attempting to avoid the curse of dimensionality by observing a lower dimensional feature subspace, we embrace dimensionality by taking advantage of inherently high-dimensional phenomena. More specifically, we show that hubness, i.e., the tendency of high-dimensional data to contain points (hubs) that frequently occur in k-nearest-neighbor lists of other points, can be successfully exploited in clustering. We validate our hypothesis by demonstrating that hubness is a good measure of point centrality within a high-dimensional data cluster, and by proposing several hubness-based clustering algorithms, showing that major hubs can be used effectively as cluster prototypes or as guides during the search for centroid-based cluster configurations. Experimental results demonstrate good performance of our algorithms in multiple settings, particularly in the presence of large quantities of noise. The proposed methods are tailored mostly for detecting approximately hyperspherical clusters and need to be extended to properly handle clusters of arbitrary shapes.
Keywords :
data mining; pattern clustering; centroid-based cluster configuration; cluster prototypes; data mining techniques; data sparsity; high-dimensional data cluster; high-dimensional data clustering; high-dimensional phenomena; hubness role; hubness-based clustering algorithm; hyperspherical cluster detection; lower dimensional feature subspace; point centrality; Approximation algorithms; Clustering algorithms; Correlation; Educational institutions; Gaussian distribution; Partitioning algorithms; Prototypes; Clustering; curse of dimensionality; hubs; nearest neighbors;
fLanguage :
English
Journal_Title :
Knowledge and Data Engineering, IEEE Transactions on
Publisher :
ieee
ISSN :
1041-4347
Type :
jour
DOI :
10.1109/TKDE.2013.25
Filename :
6427743
Link To Document :
بازگشت