Title :
Impact of GPUs Parallelism Management on Safety-Critical and HPC Applications Reliability
Author :
Rech, P. ; Pilla, Laercio L. ; Navaux, Philippe Olivier Alexandre ; Carro, Luigi
Author_Institution :
Inst. de Inf., Univ. Fed. do Rio Grande do Sul, Rio Grande, Brazil
Abstract :
Graphics Processing Units (GPUs) offer high computational power but require high scheduling strain to manage parallel processes, which increases the GPU cross section. The results of extensive neutron radiation experiments performed on NVIDIA GPUs confirm this hypothesis. Reducing the application Degree Of Parallelism (DOP) reduces the scheduling strain but also modifies the GPU parallelism management, including memory latency, thread registers number, and the processors occupancy, which influence the sensitivity of the parallel application. An analysis on the overall GPU radiation sensitivity dependence on the code DOP is provided and the most reliable configuration is experimentally detected. Finally, modifying the parallel management affects the GPU cross section but also the code execution time and, thus, the exposure to radiation required to complete computation. The Mean Workload and Executions Between Failures metrics are introduced to evaluate the workload or the number of executions computed correctly by the GPU on a realistic application.
Keywords :
graphics processing units; multi-threading; parallel algorithms; processor scheduling; safety-critical software; storage management; GPU cross section; GPU parallelism management; GPU radiation sensitivity dependence; HPC application reliability; NVIDIA GPU; application degree of parallelism; code DOP; code execution time; computational power; failure metrics; graphic processing units; mean workload; memory latency; neutron radiation experiments; parallel application; parallel process management; processors occupancy; safety-critical applications; scheduling strain; thread registers number; workload evaluation; Error analysis; Graphics processing units; Instruction sets; Neutrons; Parallel processing; Reliability; Strain; GPGPUs; parallel algorithms; radiation; reliability;
Conference_Titel :
Dependable Systems and Networks (DSN), 2014 44th Annual IEEE/IFIP International Conference on
Conference_Location :
Atlanta, GA
DOI :
10.1109/DSN.2014.49