مرکز منطقه ای اطلاع رساني علوم و فناوري - Scalability of efficient parallel K-Means

DocumentCode :

3404514

Title :

Scalability of efficient parallel K-Means

Author :

Pettinger, David ; Di Fatta, Giuseppe

Author_Institution :

Sch. of Syst. Eng., Univ. of Reading, Reading, UK

fYear :

2009

fDate :

9-11 Dec. 2009

Firstpage :

Lastpage :

101

Abstract :

Clustering is defined as the grouping of similar items in a set, and is an important process within the field of data mining. As the amount of data for various applications continues to increase, in terms of its size and dimensionality, it is necessary to have efficient clustering methods. A popular clustering algorithm is K-Means, which adopts a greedy approach to produce a set of K-clusters with associated centres of mass, and uses a squared error distortion measure to determine convergence. Methods for improving the efficiency of K-Means have been largely explored in two main directions. The amount of computation can be significantly reduced by adopting a more efficient data structure, notably a multi-dimensional binary search tree (KD-Tree) to store either centroids or data points. A second direction is parallel processing, where data and computation loads are distributed over many processing nodes. However, little work has been done to provide a parallel formulation of the efficient sequential techniques based on KD-Trees. Such approaches are expected to have an irregular distribution of computation load and can suffer from load imbalance. This issue has so far limited the adoption of these efficient K-Means techniques in parallel computational environments. In this work, we provide a parallel formulation for the KD-Tree based K-Means algorithm and address its load balancing issues.

Keywords :

data mining; greedy algorithms; parallel processing; pattern clustering; tree searching; clustering; computation loads; convergence; data mining; data structure; greedy approach; load imbalance; multidimensional binary search tree; parallel computational environments; parallel k-means; parallel processing; scalability; squared error distortion measure; Binary search trees; Clustering algorithms; Clustering methods; Concurrent computing; Convergence; Data mining; Distortion measurement; Distributed computing; Scalability; Tree data structures;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

E-Science Workshops, 2009 5th IEEE International Conference on

Conference_Location :

Oxford

Print_ISBN :

978-1-4244-5946-9

Type :

conf

DOI :

10.1109/ESCIW.2009.5407991

Filename :

5407991

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3404514