• DocumentCode
    3306859
  • Title

    BulletProof: a defect-tolerant CMP switch architecture

  • Author

    Constantinides, Kypros ; Plaza, Stephen ; Blome, Jason ; Zhang, Bin ; Bertacco, Valeria ; Mahlke, Scott ; Austin, Todd ; Orshansky, Michael

  • Author_Institution
    Adv. Comput. Archit. Lab., Michigan Univ., Ann Arbor, MI, USA
  • fYear
    2006
  • fDate
    11-15 Feb. 2006
  • Firstpage
    5
  • Lastpage
    16
  • Abstract
    As silicon technologies move into the nanometer regime, transistor reliability is expected to wane as devices become subject to extreme process variation, particle-induced transient errors, and transistor wear-out. Unless these challenges are addressed, computer vendors can expect low yields and short mean-times-to-failure. In this paper, we examine the challenges of designing complex computing systems in the presence of transient and permanent faults. We select one small aspect of a typical chip multiprocessor (CMP) system to study in detail, a single CMP router switch. To start, we develop a unified model of faults, based on the time-tested bathtub curve. Using this convenient abstraction, we analyze the reliability versus area tradeoff across a wide spectrum of CMP switch designs, ranging from unprotected designs to fully protected designs with online repair and recovery capabilities. Protection is considered at multiple levels from the entire system down through arbitrary partitions of the design. To better understand the impact of these faults, we evaluate our CMP switch designs using circuit-level timing on detailed physical layouts. Our experimental results are quite illuminating. We find that designs are attainable that can tolerate a larger number of defects with less overhead than naive triple-modular redundancy, using domain-specific techniques such as end-to-end error detection, resource sparing, automatic circuit decomposition, and iterative diagnosis and reconfiguration.
  • Keywords
    computer architecture; fault tolerance; logic design; microprocessor chips; multiprocessing systems; network routing; BulletProof; CMP router switch; automatic circuit decomposition; chip multiprocessor system; circuit-level timing; complex computing systems design; defect-tolerant CMP switch architecture; end-to-end error detection; iterative diagnosis; iterative reconfiguration; online recovery; online repair; permanent faults; reliability; resource sparing; silicon technologies; time-tested bathtub curve; transient faults; Circuit faults; Computer architecture; Computer errors; Nanoscale devices; Protection; Redundancy; Silicon; Switches; Switching circuits; Timing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High-Performance Computer Architecture, 2006. The Twelfth International Symposium on
  • ISSN
    1530-0897
  • Print_ISBN
    0-7803-9368-6
  • Type

    conf

  • DOI
    10.1109/HPCA.2006.1598108
  • Filename
    1598108