DocumentCode :
3678401
Title :
Efficient Distributed Data Clustering on Spark
Author :
Jia Li;Dongsheng Li;Yiming Zhang
Author_Institution :
Nat. Lab. for Parallel &
fYear :
2015
Firstpage :
504
Lastpage :
505
Abstract :
Data clustering is usually time-consuming since it by default needs to iteratively aggregate and process large volume of data. Approximate aggregation based on sample provides fast and quality ensured results. In this paper, we propose to leverage approximation techniques to data clustering to obtain the trade-off between clustering efficiency and result quality, along with online accuracy estimation. The proposed method is based on the bootstrap trials. We implemented this method as an Intelligent Bootstrap Library (IBL) on Spark to support efficient data clustering. Intensive evaluations show that IBL can provide a 2x speed-up over the state of art solution with the same error bound.
Keywords :
"Sparks","Accuracy","Data mining","Estimation error","Distributed databases","Approximation methods"
Publisher :
ieee
Conference_Titel :
Cluster Computing (CLUSTER), 2015 IEEE International Conference on
Type :
conf
DOI :
10.1109/CLUSTER.2015.84
Filename :
7307631
Link To Document :
بازگشت