• DocumentCode
    1949788
  • Title

    Balancing reliability, cost, and performance tradeoffs with FreeFault

  • Author

    Dong Wan Kim ; Erez, Mattan

  • Author_Institution
    Electr. & Comput. Eng. Dept., Univ. of Texas at Austin, Austin, TX, USA
  • fYear
    2015
  • fDate
    7-11 Feb. 2015
  • Firstpage
    439
  • Lastpage
    450
  • Abstract
    Memory errors have been a major source of system failures and fault rates may rise even further as memory continues to scale. This increasing fault rate, especially when combined with advent of integrated on-package memories, may exceed the capabilities of traditional fault tolerance mechanisms or significantly increase their overhead. In this paper, we present FreeFault as a hardware-only, transparent, and nearly-free resilience mechanism that is implemented entirely within a processor and can tolerate the majority of DRAM faults. FreeFault repurposes portions of the last-level cache for storing retired memory regions and augments a hardware memory scrubber to monitor memory health and aid retirement decisions. Because it relies on existing structures (cache associativity) for retirement/remapping type repair, FreeFault has essentially no hardware overhead. Because it requires a very modest portion of the cache (as small as 8KB) to cover a large fraction of DRAM faults, FreeFault has almost no impact on performance. We explain how FreeFault adds an attractive layer in an overall resilience scheme of highly-reliable and highly-available systems by delaying, and even entirely avoiding, calling upon software to make tradeoff decisions between memory capacity, performance, and reliability.
  • Keywords
    DRAM chips; cache storage; fault tolerant computing; integrated circuit reliability; performance evaluation; DRAM faults; FreeFault; fault rates; fault tolerance mechanisms; hardware memory scrubber; last-level cache; memory capacity; memory errors; memory health; reliability-cost-performance tradeoff balancing; retired memory regions; retirement decisions; system failures; tradeoff decisions; Error correction codes; Hardware; Maintenance engineering; Memory management; Random access memory; Retirement; Software;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on
  • Conference_Location
    Burlingame, CA
  • Type

    conf

  • DOI
    10.1109/HPCA.2015.7056053
  • Filename
    7056053