مرکز منطقه ای اطلاع رساني علوم و فناوري - Fault-tolerant communication runtime support for data-centric programming models

DocumentCode :

2515954

Title :

Fault-tolerant communication runtime support for data-centric programming models

Author :

Vishnu, Abhinav ; Van Dam, Huub ; De Jong, Wibe ; Balaji, Pavan ; Song, Seunghyun

Author_Institution :

Pacific Northwest Nat. Lab., Richland, WA, USA

fYear :

2010

fDate :

19-22 Dec. 2010

Firstpage :

Lastpage :

Abstract :

The largest supercomputers in the world today consist of hundreds of thousands of processing cores and many more other hardware components. At such scales, hardware faults are a commonplace, necessitating fault-resilient software systems. While different fault-resilient models are available, most focus on allowing the computational processes to survive faults. On the other hand, we have recently started investigating fault resilience techniques for data-centric programming models such as the partitioned global address space (PGAS) models. The primary difference in data-centric models is the decoupling of computation and data locality. That is, data placement is decoupled from the executing processes, allowing us to view process failure (a physical node hosting a process is dead) separately from data failure (a physical node hosting data is dead). In this paper, we take a first step toward data-centric fault resilience by designing and implementing a fault-resilient, one-sided communication runtime framework using Global Arrays and its communication system, ARMCI. The framework consists of a fault-resilient process manager; low-overhead and network-assisted remote-node fault detection module; non-data-moving collective communication primitives; and failure semantics and err or codes for one-sided communication runtime systems. Our performance evaluation indicates that the framework incurs little overhead compared to state-of-the-art designs and provides a fundamental framework of fault resiliency for PGAS models.

Keywords :

fault tolerant computing; mainframes; parallel machines; parallel programming; system recovery; ARMCI; PGAS model; core processor; data locality; data-centric fault resilience model; data-centric programming model; error code; failure semantic; fault tolerant communication runtime support; fault-resilient software system; global array; hardware fault; network-assisted remote-node fault detection module; nondata-moving collective communication; partitioned global address space model; supercomputer; Computational modeling; Data models; Electronics packaging; Fault detection; Fault tolerance; Fault tolerant systems; Runtime;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

High Performance Computing (HiPC), 2010 International Conference on

Conference_Location :

Dona Paula

Print_ISBN :

978-1-4244-8518-5

Electronic_ISBN :

978-1-4244-8519-2

Type :

conf

DOI :

10.1109/HIPC.2010.5713195

Filename :

5713195

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2515954