DocumentCode :
1926180
Title :
An efficient hardware-software approach to network fault tolerance with InfiniBand
Author :
Vishnu, Abhinav ; Krishnan, Manojkumar ; Panda, Dhabaleswar K.
Author_Institution :
Pacific NorthWest Nat. Lab., Richland, WA, USA
fYear :
2009
fDate :
Aug. 31 2009-Sept. 4 2009
Firstpage :
1
Lastpage :
9
Abstract :
In the last decade or so, clusters have observed a tremendous rise in popularity due to excellent price to performance ratio. A variety of Interconnects have been proposed during this period, with InfiniBand leading the way due to its high performance and open standard. Increasing size of the InfiniBand clusters has reduced the mean time between failures of various components of these clusters tremendously. In this paper, we specifically focus on the network component failure and propose a hybrid hardware-software approach to handling network faults. The hybrid approach leverages the user-transparent network fault detection and recovery using Automatic Path Migration (APM), and the software approach is used in the wake of APM failure. Using Global Arrays as the programming model, we implement this approach with Aggregate Remote Memory Copy Interface (ARMCI), the runtime system of Global Arrays. We evaluate our approach using various benchmarks (siosi7, pentane, h2o7 and siosi3) with NWChem, a very popular ab initio quantum chemistry application. Using the proposed approach, the applications run to completion without restart on emulated network faults and acceptable overhead for benchmarks executing for a longer period of time.
Keywords :
shared memory systems; software fault tolerance; InfiniBand; ab initio quantum chemistry application; aggregate remote memory copy interface; automatic path migration; global arrays; hardware-software approach; network component failure; network fault tolerance; open standard; programming model; runtime system; user-transparent network fault detection; Aggregates; Chemistry; Computer science; Fault detection; Fault tolerance; Hardware; High performance computing; Mathematics; Network topology; Scientific computing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cluster Computing and Workshops, 2009. CLUSTER '09. IEEE International Conference on
Conference_Location :
New Orleans, LA
ISSN :
1552-5244
Print_ISBN :
978-1-4244-5011-4
Electronic_ISBN :
1552-5244
Type :
conf
DOI :
10.1109/CLUSTR.2009.5289168
Filename :
5289168
Link To Document :
بازگشت