• DocumentCode
    1341698
  • Title

    Full Fault Resilience and Relaxed Synchronization Requirements at the Cache-Memory Interface

  • Author

    Yang, Chengmo ; Orailoglu, Alex

  • Author_Institution
    Dept. of Electr. & Comput. Eng., Univ. of Delaware, Newark, DE, USA
  • Volume
    19
  • Issue
    11
  • fYear
    2011
  • Firstpage
    1996
  • Lastpage
    2009
  • Abstract
    While multicore platforms promise significant speedup for many current applications, they also suffer from increased reliability problems as a result of ever scaling device size. The projected elevation in fault rate, together with the diverse behavior of fault manifestation, argues for highly efficient solutions of full fault resilience. Traditional duplication and checkpointing strategies typically impose sizable overhead in checkpointing execution results, or in constantly synchronizing two threads for value checking. To reduce such overhead while at the same time delivering full fault resilience, we propose an integrated fault detection and checkpointing framework, wherein the comparison and checkpointing process is performed at the cache-memory interface. By sharing a single cache between two duplicated threads, execution results can be directly verified in the cache before being written back, thus strictly protecting the memory against execution faults. Meanwhile, as unconfirmed data are allowed to be written into the cache, one thread can run well ahead of the other, thus relaxing the straightjacket of the strict execution synchronization model. If a cache block is constantly updated, further synchronization relaxation can be achieved through extending the cache design to duplicate a cache block and skip the comparison of the intermediate values.
  • Keywords
    cache storage; integrated circuit reliability; synchronisation; cache block; cache-memory interface; checkpointing strategy; duplication strategy; fault manifestation; full fault resilience; increased reliability problem; integrated fault detection; relaxed synchronization requirement; strict execution synchronization model; Checkpointing; Fault detection; Fault tolerance; Fault tolerant systems; Instruction sets; Registers; Synchronization; Checkpointing; fault detection; multicore reliability; recovery; redundant execution;
  • fLanguage
    English
  • Journal_Title
    Very Large Scale Integration (VLSI) Systems, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1063-8210
  • Type

    jour

  • DOI
    10.1109/TVLSI.2010.2067230
  • Filename
    5593910