• DocumentCode
    228740
  • Title

    A System Software Approach to Proactive Memory-Error Avoidance

  • Author

    Costa, Carlos H. A. ; Yoonho Park ; Rosenburg, Bryan S. ; Chen-Yong Cher ; Kyung Dong Ryu

  • Author_Institution
    IBM T. J. Watson Res. Center, Yorktown Heights, NY, USA
  • fYear
    2014
  • fDate
    16-21 Nov. 2014
  • Firstpage
    707
  • Lastpage
    718
  • Abstract
    Today´s HPC systems use two mechanisms to address main-memory errors. Error-correcting codes make correctable errors transparent to software, while checkpoint/restart (CR) enables recovery from uncorrectable errors. Unfortunately, CR overhead will be enormous at exascale due to the high failure rate of memory. We propose a new OS-based approach that proactively avoids memory errors using prediction. This scheme exposes correctable error information to the OS, which migrates pages and off lines unhealthy memory to avoid application crashes. We analyze memory error patterns in extensive logs from a BG/P system and show how correctable error patterns can be used to identify memory likely to fail. We implement a proactive memory management system on BG/Q by extending the firmware and Linux. We evaluate our approach with a realistic workload and compare our overhead against CR. We show improved resilience with negligible performance overhead for applications.
  • Keywords
    Linux; checkpointing; error correction codes; firmware; parallel processing; storage management; BG-P system; CR; HPC systems; Linux; OS-based approach; checkpoint-restart; correctable error patterns; error-correcting codes; firmware; memory error pattern analysis; page migration; proactive memory management system; proactive memory-error avoidance; system software approach; Algorithm design and analysis; Correlation; Error analysis; Error correction codes; Memory management; Monitoring; Prediction algorithms; Memory Structures; Operating Systems; Reliability; and Fault-Tolerance;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing, Networking, Storage and Analysis, SC14: International Conference for
  • Conference_Location
    New Orleans, LA
  • Print_ISBN
    978-1-4799-5499-5
  • Type

    conf

  • DOI
    10.1109/SC.2014.63
  • Filename
    7013045