• DocumentCode
    244409
  • Title

    Evaluating the Error Resilience of Parallel Programs

  • Author

    Bo Fang ; Pattabiraman, Karthik ; Ripeanu, Matei ; Gurumurthi, Sudhanva

  • fYear
    2014
  • fDate
    23-26 June 2014
  • Firstpage
    720
  • Lastpage
    725
  • Abstract
    As a consequence of increasing hardware fault rates, HPC systems face significant challenges in terms of reliability. Evaluating the error resilience of HPC applications is an essential step for building efficient fault-tolerant mechanisms for these applications. In this paper, we propose a methodology to characterize the resilience of OpenMP programs using fault-injection experiments. We find that the error resilience of OpenMP applications depends on the program structure and thread model, hence, these need to be taken into account while characterizing error resilience. We also report preliminary results about the correlation between the application´s error resilience and the algorithm(s) used in the application.
  • Keywords
    fault tolerance; message passing; parallel programming; software reliability; HPC systems; OpenMP program; error resilience; fault-injection experiment; fault-tolerant mechanism; hardware fault rates; parallel programs; program structure; reliability; thread model; Benchmark testing; Hardware; Instruction sets; Instruments; Message systems; Resilience; Standards; Error Resilience; OpenMP; algorithms;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Dependable Systems and Networks (DSN), 2014 44th Annual IEEE/IFIP International Conference on
  • Conference_Location
    Atlanta, GA
  • Type

    conf

  • DOI
    10.1109/DSN.2014.73
  • Filename
    6903631