• DocumentCode
    3428804
  • Title

    Detection and correction of silent data corruption for large-scale high-performance computing

  • Author

    Fiala, D. ; Mueller, Frank ; Engelmann, Christian ; Riesen, R. ; Ferreira, K. ; Brightwell, Ron

  • Author_Institution
    North Carolina State Univ., Raleigh, NC, USA
  • fYear
    2012
  • fDate
    10-16 Nov. 2012
  • Firstpage
    1
  • Lastpage
    12
  • Abstract
    Faults have become the norm rather than the exception for high-end computing clusters. Exacerbating this situation, some of these faults remain undetected, manifesting themselves as silent errors that allow applications to compute incorrect results. This paper studies the potential for redundancy to detect and correct soft errors in MPI message-passing applications while investigating the challenges inherent to detecting soft errors within MPI applications by providing transparent MPI redundancy. By assuming a model wherein corruption in application data manifests itself by producing differing MPI messages between replicas, we study the best suited protocols for detecting and correcting corrupted MPI messages. Using our fault injector, we observe that even a single error can have profound effects on applications by causing a cascading pattern of corruption which in most cases spreads to all other processes. Results indicate that our consistency protocols can successfully protect applications experiencing even high rates of silent data corruption.
  • Keywords
    application program interfaces; fault tolerant computing; message passing; MPI application; MPI redundancy; consistency protocol; fault injector; high performance computing; high-end computing cluster; message passing interface; silent data corruption correction; silent data corruption detection; soft error correction; soft error detection; Checkpointing; Computational modeling; Error correction codes; Hardware; Protocols; Receivers; Redundancy;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for
  • Conference_Location
    Salt Lake City, UT
  • ISSN
    2167-4329
  • Print_ISBN
    978-1-4673-0805-2
  • Type

    conf

  • DOI
    10.1109/SC.2012.49
  • Filename
    6468485