DocumentCode :
1159892
Title :
The performance of cache-based error recovery in multiprocessors
Author :
Janssens, Bob ; Fuchs, W. Kent
Author_Institution :
Center for Reliable & High Performance Comput., Illinois Univ., Urbana, IL, USA
Volume :
5
Issue :
10
fYear :
1994
fDate :
10/1/1994 12:00:00 AM
Firstpage :
1033
Lastpage :
1043
Abstract :
Several variations of cache-based checkpointing for rollback error recovery from transient errors in shared-memory multiprocessors have been recently developed. By modifying the cache replacement policy, these techniques use the inherent redundancy in the memory hierarchy to periodically checkpoint the computation state. Three schemes, different in the manner in which they avoid rollback propagation, are evaluated in this paper. By simulation with address traces from parallel applications running on an Encore Multimax shared-memory multiprocessor, we evaluate the performance effect of integrating the recovery schemes in the cache coherence protocol. Our results indicate that the cache-based schemes can provide checkpointing capability with low performance overhead, but with uncontrollable high variability in the checkpoint interval
Keywords :
buffer storage; performance evaluation; redundancy; shared memory systems; system recovery; virtual machines; Encore Multimax; address traces; cache coherence protocol; cache replacement policy; cache-based checkpointing; cache-based error recovery performance; cache-based schemes; checkpoint interval; computation state; inherent redundancy; low performance overhead; memory hierarchy; multiprocessors; parallel applications; performance evaluation; recovery schemes; rollback error recovery; rollback propagation; shared-memory multiprocessors; transient errors; Checkpointing; Computational modeling; Error analysis; Fault detection; Fault tolerant systems; Hardware; NASA; Protocols; Redundancy; Registers;
fLanguage :
English
Journal_Title :
Parallel and Distributed Systems, IEEE Transactions on
Publisher :
ieee
ISSN :
1045-9219
Type :
jour
DOI :
10.1109/71.313120
Filename :
313120
Link To Document :
بازگشت