DocumentCode :
3359290
Title :
Workload-dependent relative fault sensitivity and error contribution factor of GPU onchip memory structures
Author :
Shah, Rohan ; Minsu Choi ; Byunghyun Jang
Author_Institution :
Dept. of Electr. & Comput. Eng., Missouri Univ. of Sci. & Technol., Rolla, MO, USA
fYear :
2013
fDate :
15-18 July 2013
Firstpage :
271
Lastpage :
278
Abstract :
GPU (Graphics Processing Unit) is emerging as an efficient and scalable accelerator for data-parallel workloads in various applications ranging from tablet PCs to HPC (High Performance Computing) mainframes. Unlike traditional 3D graphics rendering, general-purpose compute applications demand stringent assurance of reliability. Therefore, single error tolerance schemes such as SECDED (Single Error Correcting Double Error Detecting) code are being rapidly introduced to high-end GPUs targeting high-performance general-purpose computing. However, relative fault sensitivity and error contribution of critical on-chip memory structures such as active mask stack (AMS), register file (REG) and local memory (MEM) are yet to be studied. Also, implications of single error tolerance on various GPGPU (General Purpose computing on GPU) workloads have not been quantitatively analyzed to reveal its relative cost/fault-tolerance efficiency. To address this issue, a novel Monte Carlo simulation framework has been explored in this work to enumerate and analyze well-converged fault injection data. Instead of estimating AVF (Architectural Vulnerability Factor) of each structure individually, we have injected faults to a whole memory (AMS, REG and MEM combined) in a structure-oblivious fashion. Then, we further categorized and analyzed each structure´s relative fault sensitivity and error contribution factor. Finally, we have studied implications of single error tolerance on the memory structures by further considering eight different possible ECC profiles. Results show that relative fault sensitivity and error contribution of REG is highest among the considered memory structures; therefore, ECC (Error Correction Code) protection of REG is most critical and cost-effective.
Keywords :
Monte Carlo methods; error correction codes; graphics processing units; storage management chips; AVF estimation; ECC protection; GPGPU; GPU onchip memory structures; HPC mainframes; MEM; Monte Carlo simulation framework; REG; SECDED; active mask stack; architectural vulnerability factor; data-parallel workloads; error contribution factor; error correction code protection; general purpose computing on GPU; graphics processing unit; high performance computing mainframes; local memory; on-chip memory structures; register file; relative fault sensitivity; scalable accelerator; single error correcting double error detecting code; single error tolerance schemes; workload-dependent relative fault sensitivity; Error correction codes; Graphics processing units; Hardware; Monte Carlo methods; Periodic structures; Registers; Sensitivity;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIII), 2013 International Conference on
Conference_Location :
Agios Konstantinos
Type :
conf
DOI :
10.1109/SAMOS.2013.6621134
Filename :
6621134
Link To Document :
بازگشت