DocumentCode :
2933423
Title :
A message-logging protocol for multicore systems
Author :
Meneses, Esteban ; Ni, Xiang ; Kalé, Laxmikant V.
Author_Institution :
Dept. of Comput. Sci., Univ. of Illinois at Urbana-Champaign, Urbana, IL, USA
fYear :
2012
fDate :
25-28 June 2012
Firstpage :
1
Lastpage :
6
Abstract :
Although many details of an eventual Exascale machine remain unknown, we can safely make a couple of assumptions. Exascale machines will be composed of multicore nodes and will experience frequent failures. The latter means that effective resilience support is imperative to make Exascale machines usable. The former opens up opportunities for exploring new alternatives to provide resilience support. This paper examines a new fault tolerance protocol for multicore systems. The paper contains three major parts. In the first part, we start by showing evidence that a node (and not a core) is the appropriate unit of failure. When a crash hits a machine, it usually renders unusable a whole node. Rarely, the crash brings down more than one node. The second part describes a message logging protocol that tolerates the failure of whole nodes and uses an efficient shared memory scheme to minimize overhead. We present results on various clusters and scale the approach to 1024 cores with a stencil computation. The overhead is always lower than 4%. The third part performs an analysis of reliability to understand how robust the protocol is when failures affect several nodes. Using an analytical framework and the frequency of multiple-node failures, we find that our approach is able to survive more than 99% of the crashes.
Keywords :
fault tolerant computing; protocols; shared memory systems; analytical framework; exascale machine; fault tolerance protocol; message-logging protocol; multicore system nodes; multiple-node failure frequency; overhead minimization; reliability analysis; resilience support; shared memory scheme; Computer crashes; Fault tolerance; Fault tolerant systems; Multicore processing; Protocols; Runtime; Supercomputers; fault tolerance; message logging; multicore systems;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Dependable Systems and Networks Workshops (DSN-W), 2012 IEEE/IFIP 42nd International Conference on
Conference_Location :
Boston, MA
Print_ISBN :
978-1-4673-2264-5
Electronic_ISBN :
978-1-4673-2265-2
Type :
conf
DOI :
10.1109/DSNW.2012.6264673
Filename :
6264673
Link To Document :
بازگشت