• DocumentCode
    189247
  • Title

    Multiple Parallel MapReduce k-Means Clustering with Validation and Selection

  • Author

    Dearo Garcia, Kemilly ; Coelho Naldi, Murilo

  • Author_Institution
    Dept. of Exact & Technol. Sci., Univ. Fed. de Vicosa - UFV, Rio Paranaıba, Brazil
  • fYear
    2014
  • fDate
    18-22 Oct. 2014
  • Firstpage
    432
  • Lastpage
    437
  • Abstract
    Dealing with big amounts of data is one of the challenges for clustering, which causes the need for distribution and management of huge data sets in separate repositories. New distributed systems have been designed to scale up from a single server to thousands of machines. The MapReduce framework allows to divide a job and combine the results seamlessly. The k-means is one of the few clustering algorithms that satisfies the MapReduce constrains, but it requires the previous specification of the number of clusters and is sensitive to their initialization. In this work, we propose a MapReduce clustering algorithm to execute multiple parallel runs of k-means with different initializations and number of clusters. Additionally, a MapReduce version of a cluster relative validity index is implemented and used to find the best result. The proposed algorithm is experimentally compared with the Apache Mahout Project´s MapReduce implementation of k-means. Statistical tests applied on the results indicate that the proposed algorithm can outperform the Mahout´s implementation when multiple k-means partitions are required.
  • Keywords
    data handling; parallel programming; pattern clustering; statistical testing; Apache Mahout Project MapReduce implementation; MapReduce clustering algorithm; MapReduce constraint; cluster initialization; cluster number; cluster relative validity index; data repositories; data selection; data set distribution; data set management; data validation; distributed systems; multiple k-means partitioning; multiple parallel MapReduce k-mean clustering; parallel k-means runs; statistical tests; Big data; Clustering algorithms; Data structures; Indexes; Parallel processing; Partitioning algorithms; Vectors; Cluster Selection; Clustering Validation; MapReduce Clustering; k-means;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Intelligent Systems (BRACIS), 2014 Brazilian Conference on
  • Conference_Location
    Sao Paulo
  • Type

    conf

  • DOI
    10.1109/BRACIS.2014.83
  • Filename
    6984869