• DocumentCode
    678431
  • Title

    Distributed K-Means Clustering with Low Transmission Cost

  • Author

    Coelho Naldi, Murilo ; Gabrielli Barreto Campello, Ricardo Jose

  • Author_Institution
    Dept. of Exact & Technol. Sci., Fed. Univ. of Vicosa-UFV, Paranaiba, Brazil
  • fYear
    2013
  • fDate
    19-24 Oct. 2013
  • Firstpage
    70
  • Lastpage
    75
  • Abstract
    Dealing with big amounts of data is one of the challenges for clustering, which causes the need for distribution of large data sets in separate repositories. However, most clustering techniques require the data to be centralized. One of them, the k-means, has been elected one of the most influential data mining algorithms. Although exact distributed versions of the k-means algorithm have been proposed, the algorithm is still sensitive to the selection of the initial cluster prototypes and requires that the number of clusters be specified in advance. Additionally, distributed versions of clustering algorithms usually requires multiple rounds of data transmission. This work tackles the problem of generating an approximated model for distributed clustering, based on k-means, for scenarios where the number of clusters of the distributed data is unknown and the data transmission rate is low or costly. A collection of algorithms is proposed to combine k-means clustering for each distributed subset of the data with a single round of communication. These algorithms are compared from two perspectives: the theoretical one, through asymptotic complexity analyses, and the experimental one, through a comparative evaluation of results obtained from experiments and statistical tests.
  • Keywords
    approximation theory; computational complexity; data mining; pattern clustering; statistical testing; approximation model; asymptotic complexity analysis; cluster prototype selection; clustering algorithms; clustering techniques; data clustering; data mining algorithms; data transmission rate; distributed data; distributed k-means clustering; statistical tests; transmission cost; Approximation algorithms; Clustering algorithms; Data communication; Distributed databases; Partitioning algorithms; Sociology; Statistics; clustering; distributed data sets; k-means; low data transfer;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Intelligent Systems (BRACIS), 2013 Brazilian Conference on
  • Conference_Location
    Fortaleza
  • Type

    conf

  • DOI
    10.1109/BRACIS.2013.20
  • Filename
    6726428