DocumentCode
244351
Title
Impact of GPUs Parallelism Management on Safety-Critical and HPC Applications Reliability
Author
Rech, P. ; Pilla, Laercio L. ; Navaux, Philippe Olivier Alexandre ; Carro, Luigi
Author_Institution
Inst. de Inf., Univ. Fed. do Rio Grande do Sul, Rio Grande, Brazil
fYear
2014
fDate
23-26 June 2014
Firstpage
455
Lastpage
466
Abstract
Graphics Processing Units (GPUs) offer high computational power but require high scheduling strain to manage parallel processes, which increases the GPU cross section. The results of extensive neutron radiation experiments performed on NVIDIA GPUs confirm this hypothesis. Reducing the application Degree Of Parallelism (DOP) reduces the scheduling strain but also modifies the GPU parallelism management, including memory latency, thread registers number, and the processors occupancy, which influence the sensitivity of the parallel application. An analysis on the overall GPU radiation sensitivity dependence on the code DOP is provided and the most reliable configuration is experimentally detected. Finally, modifying the parallel management affects the GPU cross section but also the code execution time and, thus, the exposure to radiation required to complete computation. The Mean Workload and Executions Between Failures metrics are introduced to evaluate the workload or the number of executions computed correctly by the GPU on a realistic application.
Keywords
graphics processing units; multi-threading; parallel algorithms; processor scheduling; safety-critical software; storage management; GPU cross section; GPU parallelism management; GPU radiation sensitivity dependence; HPC application reliability; NVIDIA GPU; application degree of parallelism; code DOP; code execution time; computational power; failure metrics; graphic processing units; mean workload; memory latency; neutron radiation experiments; parallel application; parallel process management; processors occupancy; safety-critical applications; scheduling strain; thread registers number; workload evaluation; Error analysis; Graphics processing units; Instruction sets; Neutrons; Parallel processing; Reliability; Strain; GPGPUs; parallel algorithms; radiation; reliability;
fLanguage
English
Publisher
ieee
Conference_Titel
Dependable Systems and Networks (DSN), 2014 44th Annual IEEE/IFIP International Conference on
Conference_Location
Atlanta, GA
Type
conf
DOI
10.1109/DSN.2014.49
Filename
6903602
Link To Document