DocumentCode :
2515954
Title :
Fault-tolerant communication runtime support for data-centric programming models
Author :
Vishnu, Abhinav ; Van Dam, Huub ; De Jong, Wibe ; Balaji, Pavan ; Song, Seunghyun
Author_Institution :
Pacific Northwest Nat. Lab., Richland, WA, USA
fYear :
2010
fDate :
19-22 Dec. 2010
Firstpage :
1
Lastpage :
9
Abstract :
The largest supercomputers in the world today consist of hundreds of thousands of processing cores and many more other hardware components. At such scales, hardware faults are a commonplace, necessitating fault-resilient software systems. While different fault-resilient models are available, most focus on allowing the computational processes to survive faults. On the other hand, we have recently started investigating fault resilience techniques for data-centric programming models such as the partitioned global address space (PGAS) models. The primary difference in data-centric models is the decoupling of computation and data locality. That is, data placement is decoupled from the executing processes, allowing us to view process failure (a physical node hosting a process is dead) separately from data failure (a physical node hosting data is dead). In this paper, we take a first step toward data-centric fault resilience by designing and implementing a fault-resilient, one-sided communication runtime framework using Global Arrays and its communication system, ARMCI. The framework consists of a fault-resilient process manager; low-overhead and network-assisted remote-node fault detection module; non-data-moving collective communication primitives; and failure semantics and err or codes for one-sided communication runtime systems. Our performance evaluation indicates that the framework incurs little overhead compared to state-of-the-art designs and provides a fundamental framework of fault resiliency for PGAS models.
Keywords :
fault tolerant computing; mainframes; parallel machines; parallel programming; system recovery; ARMCI; PGAS model; core processor; data locality; data-centric fault resilience model; data-centric programming model; error code; failure semantic; fault tolerant communication runtime support; fault-resilient software system; global array; hardware fault; network-assisted remote-node fault detection module; nondata-moving collective communication; partitioned global address space model; supercomputer; Computational modeling; Data models; Electronics packaging; Fault detection; Fault tolerance; Fault tolerant systems; Runtime;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
High Performance Computing (HiPC), 2010 International Conference on
Conference_Location :
Dona Paula
Print_ISBN :
978-1-4244-8518-5
Electronic_ISBN :
978-1-4244-8519-2
Type :
conf
DOI :
10.1109/HIPC.2010.5713195
Filename :
5713195
Link To Document :
بازگشت