DocumentCode :
3333724
Title :
Software schemes of reconfiguration and recovery in distributed memory multicomputers using the actor model
Author :
Peercy, M. ; Banerjee, P.
Author_Institution :
Center for Reliable & High Performance Comput., Illinois Univ., Urbana, IL, USA
fYear :
1995
fDate :
27-30 June 1995
Firstpage :
479
Lastpage :
488
Abstract :
Ideally, a multicomputer system should cope with a processor failure by reconstructing itself-and the application running on itself-in order to maintain the available computational power of the remaining processors. We discuss the continuance of running applications through permanent processor failures. We take advantage of the characteristics of the actor model of parallel computation and dynamically checkpoint the activity of the application. Consequently, the runtime system is able to continue an application through multiple nonconcurrent processor failures. We have implemented our techniques through modifications of the runtime system of the parallel language Charm on an Intel iPSC/s hypercube. After discussing the theory and implementation, we give measurements of overhead due to fault tolerance for a number of applications and demonstrate continuance of the applications after injection of one or more faults.<>
Keywords :
distributed memory systems; fault tolerant computing; hypercube networks; parallel languages; parallel processing; reconfigurable architectures; reliability; system recovery; Charm parallel language; Intel iPSC/s hypercube; actor model; applications running; computational power; distributed memory multicomputers; dynamic activity checkpointing; fault injection; fault tolerance; multicomputer system; multiple nonconcurrent processor failure; overhead; parallel computation; permanent processor failures; processor failure; reconfiguration; recovery; runtime system; software schemes; Checkpointing; Computational modeling; Concurrent computing; Distributed computing; Object oriented modeling; Parallel languages; Peer to peer computing; Power system modeling; Power system reliability; Software maintenance;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Fault-Tolerant Computing, 1995. FTCS-25. Digest of Papers., Twenty-Fifth International Symposium on
Conference_Location :
Pasadena, CA, USA
Print_ISBN :
0-8186-7079-7
Type :
conf
DOI :
10.1109/FTCS.1995.466950
Filename :
466950
Link To Document :
بازگشت