• DocumentCode
    2089040
  • Title

    Configurable Reliability in Multicore Operating Systems

  • Author

    Liao, Jianwei ; Shimosawa, Taku ; Ishikawa, Yutaka

  • Author_Institution
    Grad. Sch. of Inf. Sci. & Technol., Univ. of Tokyo, Tokyo, Japan
  • fYear
    2011
  • fDate
    24-26 Aug. 2011
  • Firstpage
    256
  • Lastpage
    262
  • Abstract
    This paper presents a new multicore operating system with a fault tolerance mechanism called Shimos2, which runs an instance of the kernel on each CPU core. It allows administrators to designate applications as requiring "high availability and reliability," without any modifications to an application\´s source code. Shimos2 contains a checkpoint/restart module, and it saves the running status of the designated applications periodically to the kernel\´s private memory area. Furthermore, a timer daemon and a monitor daemon are employed to detect kernels that are not working as a result of the host kernel being dead or hanging due to transient hardware faults. Once the host kernel is not working anymore, the stopped processes can be restarted on an idle kernel by automatically reloading the checkpointed image from where the last checkpoint was set. We have conducted experiments to evaluate Shimos2 from various aspects, including runtime overhead. The experimental results show that compared with Shimos2 without its fault tolerance mechanism, Shimos2 imposes less than 1.1% extra overhead on the designated applications themselves, and less than 1.2% extra overhead on other applications which have not been designated as "highly available." Compared to restarting the stopped processes on another virtual machine in KVM by using BLCR, Shimos2 can save around 18% of service downtime.
  • Keywords
    microcomputers; multiprocessing systems; operating system kernels; CPU core; Shimos2; application source code; checkpoint/restart module; checkpointed image; configurable reliability; fault tolerance mechanism; host kernel; kernel private memory area; monitor daemon; multicore operating system; timer daemon; transient hardware fault; Availability; Fault tolerance; Fault tolerant systems; Kernel; Monitoring; Multicore processing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computational Science and Engineering (CSE), 2011 IEEE 14th International Conference on
  • Conference_Location
    Dalian, Liaoning
  • Print_ISBN
    978-1-4577-0974-6
  • Type

    conf

  • DOI
    10.1109/CSE.2011.54
  • Filename
    6062883