• DocumentCode
    2806174
  • Title

    A Redundant Communication Approach to Scalable Fault Tolerance in PGAS Programming Models

  • Author

    Ali, Nawab ; Krishnamoorthy, Sriram ; Govind, Niranjan ; Palmer, Bruce

  • Author_Institution
    Pacific Northwest Nat. Lab., Richland, WA, USA
  • fYear
    2011
  • fDate
    9-11 Feb. 2011
  • Firstpage
    24
  • Lastpage
    31
  • Abstract
    Recent trends in high-performance computing point toward increasingly large machines with millions of processing, storage, and networking elements. Unfortunately, the reliability of these machines is inversely proportional to their size, resulting in a system-wide mean time between failures (MTBF), ranging from a few days to a few hours. As such, for long-running applications, the ability to efficiently recover from frequent failures is essential. Traditional forms of fault tolerance, such as checkpoint/restart, suffer from performance issues related to limited I/O and memory bandwidth. In this paper, we present a fault-tolerance mechanism that reduces the cost of failure recovery by maintaining shadow data structures and performing redundant remote memory accesses. Results from a computational chemistry application running at scale show that our techniques provide applications with a high degree of fault tolerance and low (2%-4%) overhead for 2048 processors.
  • Keywords
    data structures; parallel programming; software fault tolerance; MTBF; PGAS programming model; computational chemistry application; data structure; failure recovery; high performance computing; redundant communication; redundant remote memory access; scalable fault tolerance; system wide mean time between failure; Arrays; Fault tolerance; Fault tolerant systems; Gallium; Program processors; Programming; Computational chemistry; Fault tolerance; Global Arrays; NWChem;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel, Distributed and Network-Based Processing (PDP), 2011 19th Euromicro International Conference on
  • Conference_Location
    Ayia Napa
  • ISSN
    1066-6192
  • Print_ISBN
    978-1-4244-9682-2
  • Type

    conf

  • DOI
    10.1109/PDP.2011.72
  • Filename
    5738978