مرکز منطقه ای اطلاع رساني علوم و فناوري - A Succinct Distributive Big Data Clustering Algorithm Based on Local-Remote Coordination

Abstract :

Mining global patterns on big data distributed in many remote locations is a challenging task since transmitting big data in different remote data servers to the central server is prohibitively expensive. In this paper, we present a succinct distributive big data clustering algorithm based on local-remote coordination (DBDC-LRC) that aims to reduce the cost of big data transmission while maintaining an acceptable overall clustering accuracy. The algorithm is divided into three phases. In the first phase, the idea of Canopy algorithm is improved in the search for representative points with a clustering assumption that the decision boundary should lie in a low-density region, during which controllable thresholds are optimized. Noticing that in data mining a hyperellipsoid is more adaptable in shaping unknown data than a hypercube, we employ Mahalanobis distance as opposed to the Euclidean distance in determining the representative points in different remote data servers. Given that only a limited number of representative points, instead of all the remote data, are transmitted to the central server for clustering, the transmitting cost is reduced significantly. In the second phase, a weighted clustering method is used in mining the global patterns from the gathered information of representative points from various remote data servers. In the third phase, the mined global patterns are sent back to the original remote server and the related data are labeled with the same patterns according their representative points nearby. In this phase, Bayesian method is used to resolve the conflicts that one point is covered by several representative points in its neighborhood. Experiments show that DBDC-LRC is highly suitable for mining patterns from distributive big data, and the advantages of this approach include low cost, high accuracy, high robustness, and good expansibility.