مرکز منطقه ای اطلاع رساني علوم و فناوري - Efficient Distributed Data Clustering on Spark

DocumentCode :

3678401

Title :

Efficient Distributed Data Clustering on Spark

Author :

Jia Li;Dongsheng Li;Yiming Zhang

Author_Institution :

Nat. Lab. for Parallel &

fYear :

2015

Firstpage :

504

Lastpage :

505

Abstract :

Data clustering is usually time-consuming since it by default needs to iteratively aggregate and process large volume of data. Approximate aggregation based on sample provides fast and quality ensured results. In this paper, we propose to leverage approximation techniques to data clustering to obtain the trade-off between clustering efficiency and result quality, along with online accuracy estimation. The proposed method is based on the bootstrap trials. We implemented this method as an Intelligent Bootstrap Library (IBL) on Spark to support efficient data clustering. Intensive evaluations show that IBL can provide a 2x speed-up over the state of art solution with the same error bound.

Keywords :

"Sparks","Accuracy","Data mining","Estimation error","Distributed databases","Approximation methods"

Publisher :

ieee

Conference_Titel :

Cluster Computing (CLUSTER), 2015 IEEE International Conference on

Type :

conf

DOI :

10.1109/CLUSTER.2015.84

Filename :

7307631

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3678401