Warped-Shield: Tolerating Hard Faults in GPGPUs

Author

Dweik, Waleed ; Abdel-Majeed, M. ; Annavaram, Murali

Author_Institution

Ming Hsieh Dept. of Electr. Eng., Univ. of Southern California, Los Angeles, CA, USA

fYear

2014

fDate

23-26 June 2014

Firstpage

431

Lastpage

442

Abstract

Graphics processing units (GPUs) are rapidly becoming the parallel accelerators of choice to run general purpose applications. GPUs that run general purpose applications are termed as GPGPUs. Many mission-critical and long-running scientific application are being ported to run on GPGPUs. These applications demand strong computational integrity. GPGPUs, like many other digital components, face imminent reliability threats due to technology scaling. Of particular concern is the infield hard faults that are persistent and irreversible. GPGPUs comprise of dozens of streaming processors where each streaming processor employs tens of execution units, organized as single instruction multiple thread (SIMT) lanes to deliver massive parallel computational power. In this paper we exploit the massive replication of SIMT lanes to tolerate infield hard faults. First, we introduce thread shuffling to reroute threads, originally mapped to faulty SIMT lanes, to idle healthy lanes. Thread shuffling is insufficient when the number of healthy SIMT lanes is fewer than the number of active threads. To broaden the reach of thread shuffling, we propose dynamic warp deformation to split the warp into multiple sub-warps, each sub-warp uses fewer SIMT lanes thereby providing more opportunities to avoid using a faulty SIMT lane. Finally, we propose warp shuffling which exploits non-uniform degradation of different streaming processors by scheduling a warp to a streaming processor that requires fewer warp splits. Hence, warp shuffling helps to reduce the performance overhead associated with dynamic warp deformation. By deploying the proposed techniques, we can tolerate the worst case scenario of having up to three hard faults per four SIMT lane cluster with at most 36%performance degradation.

Keywords

fault tolerant computing; graphics processing units; multi-threading; parallel processing; scheduling; GPGPUs; SIMT lanes; computational integrity; dynamic warp deformation; general purpose applications; graphics processing units; infield hard fault tolerance; long-running scientific application; mission-critical scientific application; parallel accelerators; parallel computational power; performance overhead reduction; single instruction multiple thread lanes; streaming processors; thread rerouting; thread shuffling; warp scheduling; warp shuffling; warped-shield; Benchmark testing; Fault tolerance; Fault tolerant systems; Instruction sets; Optimized production technology; Registers; Single instruction multiple threads (SIMT); thread shuffling; warp deformation; warp shuffling;

fLanguage

English

Publisher

ieee

Conference_Titel

Dependable Systems and Networks (DSN), 2014 44th Annual IEEE/IFIP International Conference on

Conference_Location

Atlanta, GA

Type

conf

DOI

10.1109/DSN.2014.95

Filename

6903600

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=244347