Title :
The functional and performance tolerance of GPUs to permanent faults in registers
Author :
Tselonis, Sotiris ; Dimitsas, Vasilis ; Gizopoulos, D.
Author_Institution :
Comput. Archit. Lab., Univ. of Athens, Athens, Greece
Abstract :
Massively parallel many-core Graphics Processing Unit (GPU) architectures offer significant performance speedup in workloads with thread-level parallelism compared to contemporary multicore CPUs. For this reason, general-purpose computing using GPUs (GPGPU) is a rapidly expanding research direction in different contexts. Unlike graphics processing, GPGPU computing requires reliable operation in the presence of hardware faults whose occurrence probabilities in current and forthcoming advanced manufacturing technologies will be significant. In this paper, we focus on the aspect of tolerance of GPUs to permanent faults in their most critical storage elements: register files. By performing a comprehensive fault injection campaign on a cycle-accurate GPGPU architectural simulator, we first evaluate and classify the behavior of NVIDIA GPU CUDA kernels in the presence of permanent faults in registers. Moreover, we analyze the performance tolerance of GPUs when they operate in degraded mode (less hardware resources, less thread-level parallelism) due to the presence of multiple permanent faults in the registers of their streaming multiprocessors. Our findings confirm the intuitively expected tolerance of these architectures to faults and also quantify it in different configurations and modes.
Keywords :
fault tolerant computing; file organisation; graphics processing units; multi-threading; multiprocessing systems; parallel architectures; performance evaluation; GPGPU computing; NVIDIA GPU CUDA kernel behavior classification; advanced manufacturing technologies; comprehensive fault injection campaign; computing using GPU; critical storage elements; cycle-accurate GPGPU architectural simulator; general-purpose computing; hardware fault occurrence probabilities; hardware resources; massively parallel many-core graphic processing unit architecture; performance speedup; permanent fault tolerance analysis; register files; streaming multiprocessors; thread-level parallelism; Testing; GPU reliability; fault tolerance; permanent faults;
Conference_Titel :
On-Line Testing Symposium (IOLTS), 2013 IEEE 19th International
Conference_Location :
Chania
DOI :
10.1109/IOLTS.2013.6604089