Author :
Dong, Liangyu ; Xu, Dongping ; Liu, Zhenzhen ; Wang, Shasha
Abstract :
Clustering is the process of grouping a set of physical or abstract objects into classes of similar objects. By appropriately representing the abstract objects in a vector space, the similarity among objects is equivalent to that among vectors. Hence, the problems, such as the clustering of limited data, clustering accuracy and efficiency, can be solved properly via calculating the similarity among vectors. As the research on clustering algorithm of limited data objects has been furthered and refined, it has been applied to various fields throughout commerce, industry, daily life, and national defense etc. When it comes to the pursue for higher efficiency of these applications, the amount of data will be expanded from limited to mass, accordingly the clustering of limited data will be massively enlarged. Thus, the implementation of the traditional serial programming algorithm, i.e. the goals of clustering will be encountered with a devastating challenge. The arising of Hadoop cloud computing platform throws light on the computing of mass data clustering. Nonetheless, under the new circumstances, the issues, like the efficiency and accuracy of clustering calculation, are still the focuses of information specialists. The essay proposes a K-means parallel clustering algorithm based on Hadoop platform and MapReduce programming model aiming at improving the traditional serial K-means clustering algorithm, which also improves the random selection of initial clustering center in K-means algorithm combined with Canopy algorithm. The experimental result shows that the improved algorithm reduces the time complexity. Moreover, the accuracy of the results and the execution efficiency have increased by 40% respectively.