مرکز منطقه ای اطلاع رساني علوم و فناوري - Fault detection and tolerance mechanisms for future 1000 core systems

DocumentCode :

1815133

Title :

Fault detection and tolerance mechanisms for future 1000 core systems

Author :

Fechner, B. ; Garbade, A. ; Weis, Sebastian ; Ungerer, Theo

Author_Institution :

Dept. of Comput. Sci., Univ. of Augsburg, Augsburg, Germany

fYear :

2013

fDate :

1-5 July 2013

Firstpage :

552

Lastpage :

554

Abstract :

The enormous growth in integration density enables to build processors with more and more cores on a single die, but also makes them orders of magnitude more vulnerable to faults due to voltage fluctuation, radiation, and process variations [4] etc. Since this trend will continue in the future, fault-tolerance mechanisms must be an essential part of such future systems if the computations are to be carried out on a reliable basis. Already, chip manufacturers have taken measures to handle faults in current multi-core processors such as error correcting codes for busses, caches etc. With a huge number of cores, common strategies like dual modular and triple modular redundant processing [5] along with massive parallel computing are possible. Threaded dataflow execution models are one way to exploit the parallelism of future 1000 core systems. Current GPU architectures reflect that [3]. The side-effect free execution of threads within the dataflow execution model can not only be used to provide massive parallel computational capacity, but also enables simple and efficient rollback mechanisms [16]. In this paper, we describe fault detection and tolerance mechanisms investigated within the TERAFLUX EC project [17], which offers a solution to exploit the massive parallelism offered by dataflow architectures at all abstraction levels.

Keywords :

data flow computing; fault tolerant computing; multi-threading; multiprocessing systems; parallel architectures; system recovery; 1000-core systems; GPU architectures; TERAFLUX EC project; chip manufacturers; dataflow architectures; dual modular redundant processing; fault detection; fault handling; fault tolerance mechanism; massive parallel computational capacity; multicore processors; rollback mechanisms; side-effect free thread execution; threaded dataflow execution models; triple modular redundant processing; Computer architecture; Fault detection; Frequency modulation; Instruction sets; Message systems; Monitoring; Reliability;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

High Performance Computing and Simulation (HPCS), 2013 International Conference on

Conference_Location :

Helsinki

Print_ISBN :

978-1-4799-0836-3

Type :

conf

DOI :

10.1109/HPCSim.2013.6641467

Filename :

6641467

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1815133