Title :
Improving Fault Tolerance through Crash Recovery
Author :
Yeh, Tsozen ; Cheng, Weian
Author_Institution :
Dept. of Comput. Sci. & Inf. Eng., Fu Jen Catholic Univ., Taipei, Taiwan
Abstract :
Computers are indispensable to modern human society. In spite of the tremendous amount of efforts spent on improving the computer software quality, however, software bugs are inevitable. The problems of software bugs could range from minor (such as fonts) to catastrophic (such as mission-critical control programs). There have been different methods proposed by researchers to deal with the software failure. Restarting the entire or parts of the system could fix certain environment-related bugs, such as race condition or memory overflow, with the extra cost of time and system resources. Unfortunately, if software failure is originated from either invalid user´s input or faulty logic in the program, simply restarting programs is unlikely to avoid the system crash. Under such circumstances, system crash could possibly be averted by letting the user re-input legitimate values or choose different parts of the program to execute. The idea of making checkpoints can help achieve this goal. We propose and develop a system, which takes multiple checkpoints as running programs accepting input from the user. When the system is about to crash, our system will provide available checkpoints made for that particular program to users. So they can pick a specific checkpoint to re-input legitimate values or various options to continue the execution of the program just crashed. We implemented our design in the Linux operating system and conducted experiments on diverse programs to evaluate the performance. Our results show that, with our design, the system can enable users to re-execute programs without observable performance effect when both taking checkpoints and restoring program execution from different checkpoints.
Keywords :
Linux; checkpointing; program debugging; software fault tolerance; software performance evaluation; software quality; Linux operating system; checkpoint making; computer software quality; crash recovery; design; fault tolerance; performance evaluation; program execution restoration; software bug; software failure; Calculators; Computer bugs; Kernel; Linux; fault tolerance; operating system; system crash;
Conference_Titel :
Biometrics and Security Technologies (ISBAST), 2012 International Symposium on
Conference_Location :
Taipei
Print_ISBN :
978-1-4673-0917-2
DOI :
10.1109/ISBAST.2012.15