• DocumentCode
    2990007
  • Title

    P-means, a parallel clustering algorithm for a heterogeneous multi-processor environment

  • Author

    Foina, Aislan G. ; Planas, Judit ; Badia, Rosa M. ; Ramirez-Fernandez, Francisco Javier

  • Author_Institution
    Barcelona Supercomput. Center, Spanish Nat. Res. Council, Barcelona, Spain
  • fYear
    2011
  • fDate
    4-8 July 2011
  • Firstpage
    239
  • Lastpage
    248
  • Abstract
    G-means is a data mining clustering algorithm based on k-means, used to find the number of Gaussian distributions and their centers inside a multi-dimensional dataset. This paper presents the performance gain obtained from the development of a parallel G-means algorithm for a heterogeneous multi-processor environment using the StarSs framework, called here P means. The P-means execution was divided into 6 well defined steps, where each step was analyzed to create a hierarchical task structure in order to parallelize the execution enabling it to explore the hierarchy and heterogeneity of the Cell BE blades and others heterogeneous architectures. The algorithm implementation was also adapted to perform sequential timing measures to evaluate the Amdahl´s law, to compare the theoretical calculation and the execution times´ measurements and to introduce parallel computation by using the StarSs framework. The algorithm was executed using a 30 clusters dataset containing 600 thousand points of 60 dimensions in different hardware configurations in order to compare its execution time and speedup, and it showed a overall speedup of more than 18 times. A successful experimentation with real data demonstrated the usefulness of the algorithm.
  • Keywords
    Gaussian distribution; data mining; multiprocessing systems; parallel algorithms; pattern clustering; Amdahl´s law; Cell BE blades; Gaussian distribution; P means; StarSs framework; data mining clustering algorithm; heterogeneous architecture; heterogeneous multiprocessor environment; hierarchical task structure; k-means; multidimensional dataset; parallel G-means algorithm; parallel clustering algorithm; sequential timing measures; Clustering algorithms; Computer architecture; Data mining; Gaussian distribution; Microprocessors; Program processors; Programming; Clustering; Data mining; Heterogeneous system; Parallel programming; Software performance;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing and Simulation (HPCS), 2011 International Conference on
  • Conference_Location
    Istanbul
  • Print_ISBN
    978-1-61284-380-3
  • Type

    conf

  • DOI
    10.1109/HPCSim.2011.5999830
  • Filename
    5999830