DocumentCode :
1815133
Title :
Fault detection and tolerance mechanisms for future 1000 core systems
Author :
Fechner, B. ; Garbade, A. ; Weis, Sebastian ; Ungerer, Theo
Author_Institution :
Dept. of Comput. Sci., Univ. of Augsburg, Augsburg, Germany
fYear :
2013
fDate :
1-5 July 2013
Firstpage :
552
Lastpage :
554
Abstract :
The enormous growth in integration density enables to build processors with more and more cores on a single die, but also makes them orders of magnitude more vulnerable to faults due to voltage fluctuation, radiation, and process variations [4] etc. Since this trend will continue in the future, fault-tolerance mechanisms must be an essential part of such future systems if the computations are to be carried out on a reliable basis. Already, chip manufacturers have taken measures to handle faults in current multi-core processors such as error correcting codes for busses, caches etc. With a huge number of cores, common strategies like dual modular and triple modular redundant processing [5] along with massive parallel computing are possible. Threaded dataflow execution models are one way to exploit the parallelism of future 1000 core systems. Current GPU architectures reflect that [3]. The side-effect free execution of threads within the dataflow execution model can not only be used to provide massive parallel computational capacity, but also enables simple and efficient rollback mechanisms [16]. In this paper, we describe fault detection and tolerance mechanisms investigated within the TERAFLUX EC project [17], which offers a solution to exploit the massive parallelism offered by dataflow architectures at all abstraction levels.
Keywords :
data flow computing; fault tolerant computing; multi-threading; multiprocessing systems; parallel architectures; system recovery; 1000-core systems; GPU architectures; TERAFLUX EC project; chip manufacturers; dataflow architectures; dual modular redundant processing; fault detection; fault handling; fault tolerance mechanism; massive parallel computational capacity; multicore processors; rollback mechanisms; side-effect free thread execution; threaded dataflow execution models; triple modular redundant processing; Computer architecture; Fault detection; Frequency modulation; Instruction sets; Message systems; Monitoring; Reliability;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
High Performance Computing and Simulation (HPCS), 2013 International Conference on
Conference_Location :
Helsinki
Print_ISBN :
978-1-4799-0836-3
Type :
conf
DOI :
10.1109/HPCSim.2013.6641467
Filename :
6641467
Link To Document :
بازگشت