Title :
CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems
Author :
Gupta, R. ; Beckman, P. ; Park, B.-H. ; Lusk, E. ; Hargrove, P. ; Geist, A. ; Panda, D.K. ; Lumsdaine, A. ; Dongarra, J.
Author_Institution :
Argonne Nat. Lab., Argonne, IL, USA
Abstract :
Considerable work has been done on providing fault tolerance capabilities for different software components on large-scale high-end computing systems. Thus far, however, these fault-tolerant components have worked insularly and independently and information about faults is rarely shared. Such lack of system-wide fault tolerance is emerging as one of the biggest problems on leadership-class systems. In this paper, we propose a coordinated infrastructure, named CIFTS, that enables system software components to share fault information with each other and adapt to faults in a holistic manner. Central to the CIFTS infrastructure is a Fault Tolerance Backplane (FTB) that enables fault notification and awareness throughout the software stack, including fault-aware libraries, middleware, and applications. We present details of the CIFTS infrastructure and the interface specification that has allowed various software programs, including MPICH2, MVAPICH, Open MPI, and PVFS, to plug into the CIFTS infrastructure. Further, through a detailed evaluation we demonstrate the nonintrusive low-overhead capability of CIFTS that lets applications run with minimal performance degradation.
Keywords :
formal specification; message passing; middleware; object-oriented programming; software fault tolerance; software libraries; MPICH2; MVAPICH; Open MPI; coordinated infrastructure; fault awareness; fault information sharing; fault notification; fault tolerance backplane; fault-aware libraries; fault-tolerant system; interface specification; large-scale high-end computing system; leadership-class system; middleware; performance degradation; software component; software program; software stack; Application software; Backplanes; Degradation; Fault tolerance; Fault tolerant systems; Large-scale systems; Middleware; Plugs; Software libraries; System software;
Conference_Titel :
Parallel Processing, 2009. ICPP '09. International Conference on
Conference_Location :
Vienna
Print_ISBN :
978-1-4244-4961-3
Electronic_ISBN :
0190-3918
DOI :
10.1109/ICPP.2009.20