• DocumentCode
    568637
  • Title

    Transient Fault Tolerance for ccNUMA Architecture

  • Author

    Xingjun Zhang ; Endong Wang ; Feilong Tang ; Meishun Yang ; Hengyi Wei ; Xiaoshe Dong

  • Author_Institution
    Dept. of Comput. Sci. & Technol., Xi´an Jiaotong Univ., Xi´an, China
  • fYear
    2012
  • fDate
    4-6 July 2012
  • Firstpage
    197
  • Lastpage
    202
  • Abstract
    Transient fault is a critical concern in the reliability of microprocessors system. The software fault tolerance is more flexible and lower cost than the hardware fault tolerance. And also, as architectural trends point toward multi core designs, there is substantial interest in adapting parallel and redundancy hardware resources for transient fault tolerance. The paper proposes a process-level fault tolerance technique, a software centric approach, which efficiently schedule and synchronize of redundancy processes with ccNUMA processors redundancy. So it can improve efficiency of redundancy processes running, and reduce time and space overhead. The paper focuses on the researching of redundancy processes error detection and handling method. A real prototype is implemented that is designed to be transparent to the application. The test results show that the system can timely detect soft errors of CPU and memory that cause the redundancy processes exception, and meanwhile ensure that the services of application is uninterrupted and delay shortly.
  • Keywords
    delays; error detection; error handling; memory architecture; multiprocessing systems; processor scheduling; redundancy; software fault tolerance; synchronisation; CPU; ccNUMA architecture; ccNUMA processor redundancy; delay; error detection method; error handling method; microprocessor system; multicore design; parallel resource; process level fault tolerance; processor scheduling; prototype; reliability; soft error detection; software centric approach; synchronization; transient fault tolerance; Fault tolerant systems; Hardware; Kernel; Redundancy; Synchronization; Transient analysis; Transient fault; ccNUMA; dual-process;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Innovative Mobile and Internet Services in Ubiquitous Computing (IMIS), 2012 Sixth International Conference on
  • Conference_Location
    Palermo
  • Print_ISBN
    978-1-4673-1328-5
  • Type

    conf

  • DOI
    10.1109/IMIS.2012.188
  • Filename
    6296854