• DocumentCode
    2089059
  • Title

    Sustainable GPU Computing at Scale

  • Author

    Shi, Justin Y. ; Taifi, Moussa ; Khreishah, Abdallah ; Wu, Jie

  • Author_Institution
    Dept. of Comput. & Inf. Sci., Temple Univ., Philadelphia, PA, USA
  • fYear
    2011
  • fDate
    24-26 Aug. 2011
  • Firstpage
    263
  • Lastpage
    272
  • Abstract
    General purpose GPU (GPGPU) computing has produced the fastest running supercomputers in the world. For continued sustainable progress, GPU computing at scale also need to address two open issues: a) how increase applications mean time between failures (MTBF) as we increase supercomputer´s component counts, and b) how to minimize unnecessary energy consumption. Since energy consumption is defined by the number of components used, we consider a sustainable high performance computing (HPC) application can allow better performance and reliability at the same time when adding computing or communication components. This paper reports a two-tier semantic statistical multiplexing framework for sustainable HPC at scale. The idea is to leverage the powers of statistic multiplexing to tame the nagging HPC scalability challenges. We include the theoretical model, sustainability analysis and computational experiments with automatic system level multiple CPU/GPU failure containment. Our results show that assuming three times slowdown of the statistical multiplexing layer, for an application using 1024 processors with 35% checkpoint overhead, the two-tier framework will produce sustained time and energy savings for MTBF less than 6 hours. With 5% checkpoint overhead, 1.5 hour MTBF would be the break even point. These results suggest the practical feasibility for the proposed two-tier framework.
  • Keywords
    computer graphic equipment; coprocessors; parallel machines; statistical multiplexing; automatic system level multiple CPU failure containment; automatic system level multiple GPU failure containment; energy consumption; general purpose GPU computing; mean time between failures; statistical multiplexing layer; supercomputers; sustainability analysis; sustainable GPU computing; sustainable high performance computing application; two-tier semantic statistical multiplexing framework; Graphics processing unit; Multiplexing; Parallel processing; Peer to peer computing; Scalability; Semantics; Switches; Data parallel processing; Fault tolerant GPU computing; Semantic statistical multiplexing; Tuple switching network;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computational Science and Engineering (CSE), 2011 IEEE 14th International Conference on
  • Conference_Location
    Dalian, Liaoning
  • Print_ISBN
    978-1-4577-0974-6
  • Type

    conf

  • DOI
    10.1109/CSE.2011.55
  • Filename
    6062884