• DocumentCode
    2933354
  • Title

    ROSE::FTTransform - A source-to-source translation framework for exascale fault-tolerance research

  • Author

    Lidman, Jacob ; Quinlan, Daniel J. ; Liao, Chunhua ; McKee, Sally A.

  • Author_Institution
    Lawrence Livermore Nat. Lab., Lawrence, CA, USA
  • fYear
    2012
  • fDate
    25-28 June 2012
  • Firstpage
    1
  • Lastpage
    6
  • Abstract
    Exascale computing systems will require sufficient resilience to tolerate numerous types of hardware faults while still assuring correct program execution. Such extreme-scale machines are expected to be dominated by processors driven at lower voltages (near the minimum 0.5 volts for current transistors). At these voltage levels, the rate of transient errors increases dramatically due to the sensitivity to transient and geographically localized voltage drops on parts of the processor chip. To achieve power efficiency, these processors are likely to be streamlined and minimal, and thus they cannot be expected to handle transient errors entirely in hardware. Here we present an open, compiler-based framework to automate the armoring of High Performance Computing (HPC) software to protect it from these types of transient processor errors. We develop an open infrastructure to support research work in this area, and we define tools that, in the future, may provide more complete automated and/or semi-automated solutions to support software resiliency on future exascale architectures. Results demonstrate that our approach is feasible, pragmatic in how it can be separated from the software development process, and reasonably efficient (0% to 30% overhead for the Jacobi iteration on common hardware; and 20%, 40%, 26%, and 2% overhead for a randomly selected subset of benchmarks from the Livermore Loops [1]).
  • Keywords
    fast Fourier transforms; fault tolerant computing; mainframes; microprocessor chips; power aware computing; program compilers; program interpreters; sensitivity; transients; HPC software; ROSE::FTTransform; automated solutions; exascale architectures; exascale computing systems; exascale fault tolerance research; extreme scale machines; hardware fault tolerance; high performance computing; open compiler-based framework; power efficiency; processor chip; program execution; semiautomated solutions; software development process; software resiliency; source-to-source translation framework; streamlined processor; transient error handling; transient sensitivity; voltage drops; Arrays; Fault tolerant systems; Kernel; Optimization; Program processors; Redundancy; Exascale; Fault Tolerance; High Performance Computing; Redundancy; Source-to-Source Compiler;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Dependable Systems and Networks Workshops (DSN-W), 2012 IEEE/IFIP 42nd International Conference on
  • Conference_Location
    Boston, MA
  • Print_ISBN
    978-1-4673-2264-5
  • Electronic_ISBN
    978-1-4673-2265-2
  • Type

    conf

  • DOI
    10.1109/DSNW.2012.6264672
  • Filename
    6264672