DocumentCode :
2089040
Title :
Configurable Reliability in Multicore Operating Systems
Author :
Liao, Jianwei ; Shimosawa, Taku ; Ishikawa, Yutaka
Author_Institution :
Grad. Sch. of Inf. Sci. & Technol., Univ. of Tokyo, Tokyo, Japan
fYear :
2011
fDate :
24-26 Aug. 2011
Firstpage :
256
Lastpage :
262
Abstract :
This paper presents a new multicore operating system with a fault tolerance mechanism called Shimos2, which runs an instance of the kernel on each CPU core. It allows administrators to designate applications as requiring "high availability and reliability," without any modifications to an application\´s source code. Shimos2 contains a checkpoint/restart module, and it saves the running status of the designated applications periodically to the kernel\´s private memory area. Furthermore, a timer daemon and a monitor daemon are employed to detect kernels that are not working as a result of the host kernel being dead or hanging due to transient hardware faults. Once the host kernel is not working anymore, the stopped processes can be restarted on an idle kernel by automatically reloading the checkpointed image from where the last checkpoint was set. We have conducted experiments to evaluate Shimos2 from various aspects, including runtime overhead. The experimental results show that compared with Shimos2 without its fault tolerance mechanism, Shimos2 imposes less than 1.1% extra overhead on the designated applications themselves, and less than 1.2% extra overhead on other applications which have not been designated as "highly available." Compared to restarting the stopped processes on another virtual machine in KVM by using BLCR, Shimos2 can save around 18% of service downtime.
Keywords :
microcomputers; multiprocessing systems; operating system kernels; CPU core; Shimos2; application source code; checkpoint/restart module; checkpointed image; configurable reliability; fault tolerance mechanism; host kernel; kernel private memory area; monitor daemon; multicore operating system; timer daemon; transient hardware fault; Availability; Fault tolerance; Fault tolerant systems; Kernel; Monitoring; Multicore processing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computational Science and Engineering (CSE), 2011 IEEE 14th International Conference on
Conference_Location :
Dalian, Liaoning
Print_ISBN :
978-1-4577-0974-6
Type :
conf
DOI :
10.1109/CSE.2011.54
Filename :
6062883
Link To Document :
بازگشت