Title :
Active/active replication for highly available HPC system services
Author :
Engelmann, C. ; Scott, S.L. ; Leangsuksun, C. ; He, X.
Author_Institution :
Comput. Sci. & Math. Div., Oak Ridge Nat. Lab., TN, USA
Abstract :
Today´s high performance computing systems have several reliability deficiencies resulting in availability and serviceability issues. Head and service nodes represent a single point of failure and control for an entire system as they render it inaccessible and unmanageable in case of a failure until repair, causing a significant downtime. This paper introduces two distinct replication methods (internal and external) for providing symmetric active/active high availability for multiple head and service nodes running in virtual synchrony. It presents a comparison of both methods in terms of expected correctness, ease-of-use and performance based on early results from ongoing work in providing symmetric active/active high availability for two HPC system services (TORQUE and PVFS metadata server). It continues with a short description of a distributed mutual exclusion algorithm and a brief statement regarding the handling of Byzantine failures. This paper concludes with an overview of past and ongoing work, and a short summary of the presented research.
Keywords :
parallel processing; system recovery; Byzantine failures; PVFS metadata server; TORQUE; active replication; distributed mutual exclusion algorithm; high performance computing system; highly available HPC system service; Availability; Computational modeling; Computer architecture; Concurrent computing; Contracts; Control systems; High performance computing; Laboratories; Quantum computing; Resource management;
Conference_Titel :
Availability, Reliability and Security, 2006. ARES 2006. The First International Conference on
Print_ISBN :
0-7695-2567-9
DOI :
10.1109/ARES.2006.23