• DocumentCode
    2991208
  • Title

    Optimisation and parallelisation of the partitioning around medoids function in R

  • Author

    Piotrowski, Michal ; Sloan, Terence M. ; Mewsissen, Muriel ; Forster, Thorsten ; Mitchell, Lawrence ; Petrou, Savvas ; Dobrezelecki, Bartosz ; Ghazal, Peter ; Trew, Arthur ; Hill, Jon

  • Author_Institution
    EPCC, Univ. of Edinburgh, Edinburgh, UK
  • fYear
    2011
  • fDate
    4-8 July 2011
  • Firstpage
    707
  • Lastpage
    713
  • Abstract
    R is a free statistical programming language commonly used for the analysis of high-throughput microarray and other data. It is currently unable to easily utilise multi processor architectures without substantial changes to existing R scripts. Further, working with large volumes of data often leads to slow processing and even memory allocation faults. A recent survey highlighted clustering algorithms as both computation and data intensive bottlenecks in post-genomic data analyses. These algorithms aim to sort numeric vectors (such as gene expression profiles) into groups by minimising vector distances within groups and maximising them between groups. This paper describes the optimisation and parallelisation of a popular clustering algorithm, partitioning around medoids (PAM), for the Simple Parallel R INTerface (SPRINT). SPRINT allows R users to exploit high performance computing systems without expert knowledge of such systems. This paper reports on a serial optimisation of the original code and a subsequent parallel implementation. The parallel implementation enables the processing of data sets that exceed the available physical memory and can yield, depending on the data set, over 100-fold increase in performance.
  • Keywords
    biology computing; genomics; microprocessor chips; multiprocessing systems; optimisation; parallel architectures; parallel processing; pattern clustering; statistical analysis; R scripts; clustering algorithms; data intensive bottlenecks; high throughput microarray; medoids function; memory allocation faults; multiprocessor architectures; optimisation; partitioning around medoids; partitioning parallelisation; post genomic data analyses; simple parallel R interface; statistical programming language; Algorithm design and analysis; Benchmark testing; Clustering algorithms; High performance computing; Memory management; Optimization; Partitioning algorithms; Clustering; High Performance Computing; Message Passing Interface; Microarray; R;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing and Simulation (HPCS), 2011 International Conference on
  • Conference_Location
    Istanbul
  • Print_ISBN
    978-1-61284-380-3
  • Type

    conf

  • DOI
    10.1109/HPCSim.2011.5999896
  • Filename
    5999896