Multiple Parallel MapReduce k-Means Clustering with Validation and Selection

Author

Dearo Garcia, Kemilly ; Coelho Naldi, Murilo

Author_Institution

Dept. of Exact & Technol. Sci., Univ. Fed. de Vicosa - UFV, Rio Paranaıba, Brazil

fYear

2014

fDate

18-22 Oct. 2014

Firstpage

432

Lastpage

437

Abstract

Dealing with big amounts of data is one of the challenges for clustering, which causes the need for distribution and management of huge data sets in separate repositories. New distributed systems have been designed to scale up from a single server to thousands of machines. The MapReduce framework allows to divide a job and combine the results seamlessly. The k-means is one of the few clustering algorithms that satisfies the MapReduce constrains, but it requires the previous specification of the number of clusters and is sensitive to their initialization. In this work, we propose a MapReduce clustering algorithm to execute multiple parallel runs of k-means with different initializations and number of clusters. Additionally, a MapReduce version of a cluster relative validity index is implemented and used to find the best result. The proposed algorithm is experimentally compared with the Apache Mahout Project´s MapReduce implementation of k-means. Statistical tests applied on the results indicate that the proposed algorithm can outperform the Mahout´s implementation when multiple k-means partitions are required.

Keywords

data handling; parallel programming; pattern clustering; statistical testing; Apache Mahout Project MapReduce implementation; MapReduce clustering algorithm; MapReduce constraint; cluster initialization; cluster number; cluster relative validity index; data repositories; data selection; data set distribution; data set management; data validation; distributed systems; multiple k-means partitioning; multiple parallel MapReduce k-mean clustering; parallel k-means runs; statistical tests; Big data; Clustering algorithms; Data structures; Indexes; Parallel processing; Partitioning algorithms; Vectors; Cluster Selection; Clustering Validation; MapReduce Clustering; k-means;

fLanguage

English

Publisher

ieee

Conference_Titel

Intelligent Systems (BRACIS), 2014 Brazilian Conference on

Conference_Location

Sao Paulo

Type

conf

DOI

10.1109/BRACIS.2014.83

Filename

6984869