Title :
A gracefully degrading massively parallel system using the BSP model, and its evaluation
Author :
Savva, Andreas ; Nanya, Takashi
Author_Institution :
Fujitsu Labs. Ltd., Kawasaki, Japan
fDate :
1/1/1999 12:00:00 AM
Abstract :
The Bulk-Synchronous Parallel (BSP) Model was proposed as a unifying model for parallel computation. By using Randomized Shared Memory (RSM), the model offers an asymptotically optimal emulation of the Parallel Random Access Machine (PRAM). By using the BSP model with RSM, we construct a gracefully degrading massively parallel system using a fault tolerance (FT) scheme that relies on memory duplication to ensure global memory integrity and to speed up the reconfiguration. After a fault occurs, global reconfiguration restores the logical properties of the system. Work done during reconfiguration is shared equally among the live processors, with minimal coordination. We analyze, at the level of the BSP model, how the performance of a system may change as processors fail and the performance of the interconnection network degrades. We relate the change in overall system performance to the change in computation and communication load on the live processors. Further, we show how to estimate the overhead imposed by the FT scheme. We evaluate the reconfiguration time, the overhead, and graceful degradation of the system experimentally by an implementation on a Massively Parallel Processor (MPP). We show that the predictions about the degradation of the system and the overhead cost of the scheme are accurate
Keywords :
fault tolerant computing; parallel processing; performance evaluation; BSP model; asymptotically optimal emulation; bulk-synchronous parallel model; fault tolerance; global memory integrity; graceful degradation; gracefully degrading massively parallel system; logical properties; memory duplication; parallel random access machine; randomized shared memory; system performance; Computational modeling; Concurrent computing; Degradation; Emulation; Failure analysis; Fault tolerant systems; Multiprocessor interconnection networks; Performance analysis; Phase change random access memory; System performance;
Journal_Title :
Computers, IEEE Transactions on