Title :
A Delayed Checkpoint Approach for Communication-Induced Checkpointing in Autonomic Computing
Author :
Calixto Simon, Alberto Calixto ; Hernandez, Saul E. Pomares ; Perez Cruz, Jose Roberto
Author_Institution :
Inst. Nac. de Astrofis., Opt. y Electron., Tonantzintla, Mexico
Abstract :
Although the initiative of Autonomic Computing was introduced a dozen years ago, several challenges remain open. One of these challenges is the efficient monitoring at runtime oriented to the detection, diagnosis, and repair of problems that result from failures or bugs in software and/or hardware components. For this purpose, Communication-induced Checkpointing (CIC) can be a useful tool. Communication-induced Checkpointing has been used to attack a wide range of problems that arise in distributed systems, such as rollback recovery, software debugging and software verification, among others. In CIC algorithms, an autonomic component (process) asynchronously cooperates by exchanging information on the application messages about saved local states called checkpoints. CIC aims to form global consistent snapshots by grouping checkpoints (one by each component) in a non-coordinated way. To achieve this, CIC solutions continuously monitor the exchanged control information to identify possible dangerous checkpointing patterns. When a dangerous pattern is identified, it is broken by locally triggering a forced checkpoint. Nevertheless, as we will show, not all forced checkpoints triggered by current solutions are necessary. In this paper, we present a delayed checkpoint approach suitable for autonomic computing that reduces forced checkpoints by establishing certain triggering rules that we call safe checkpoint conditions. Finally, some results are presented which show that our proposal is more efficient than other current solutions.
Keywords :
checkpointing; distributed processing; fault tolerant computing; program debugging; program verification; CIC solutions; autonomic computing; communication-induced checkpointing; control information exchanging; dangerous checkpointing patterns; delayed checkpoint approach; distributed systems; global consistent snapshots; hardware components; rollback recovery; safe checkpoint conditions; software debugging; software diagnosis; software failures; software verification; Arrays; Checkpointing; Clocks; Delays; Monitoring; Software; Software algorithms; Autonomic Computing; Communication-induced checkpointing; Distributed Systems;
Conference_Titel :
Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE), 2013 IEEE 22nd International Workshop on
Conference_Location :
Hammamet
Print_ISBN :
978-1-4799-0405-1
DOI :
10.1109/WETICE.2013.15