• DocumentCode
    244349
  • Title

    A-ABFT: Autonomous Algorithm-Based Fault Tolerance for Matrix Multiplications on Graphics Processing Units

  • Author

    Braun, Claus ; Halder, Sebastian ; Wunderlich, Hans Joachim

  • Author_Institution
    Inst. of Comput. Archit. & Comput. Eng., Univ. of Stuttgart, Stuttgart, Germany
  • fYear
    2014
  • fDate
    23-26 June 2014
  • Firstpage
    443
  • Lastpage
    454
  • Abstract
    Graphics processing units (GPUs) enable large-scale scientific applications and simulations on the desktop. To allow scientific computing on GPUs with high performance and reliability requirements, the application of software-based fault tolerance is attractive. Algorithm-Based Fault Tolerance (ABFT) protects important scientific operations like matrix multiplications. However, the application to floating-point operations necessitates the runtime classification of errors into inevitable rounding errors, allowed compute errors in the magnitude of such rounding errors, and into critical errors that are larger than those and not tolerable. Hence, an ABFT scheme needs suitable rounding error bounds to detect errors reliably. The determination of such error bounds is a highly challenging task, especially since it has to be integrated tightly into the algorithm and executed autonomously with low performance overhead. In this work, A-ABFT for matrix multiplications on GPUs is introduced, which is a new, parallel ABFT scheme that determines rounding error bounds autonomously at runtime with low performance overhead and high error coverage.
  • Keywords
    fault tolerance; floating point arithmetic; graphics processing units; matrix multiplication; A-ABFT; GPU; autonomous algorithm-based fault tolerance; floating-point operations; graphics processing units; matrix multiplications; reliability requirements; software-based fault tolerance; Fault tolerance; Fault tolerant systems; Graphics processing units; Probabilistic logic; Reactive power; Upper bound; Vectors; ABFT; Algorithm-Based Fault Tolerance; GPU; Matrix Multiplication; Rounding Error Estimation;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Dependable Systems and Networks (DSN), 2014 44th Annual IEEE/IFIP International Conference on
  • Conference_Location
    Atlanta, GA
  • Type

    conf

  • DOI
    10.1109/DSN.2014.48
  • Filename
    6903601