• DocumentCode
    549545
  • Title

    DRAIN: Distributed Recovery Architecture for Inaccessible Nodes in multi-core chips

  • Author

    DeOrio, Andrew ; Aisopos, Kostantinos ; Bertacco, Valeria ; Peh, Li-Shiuan

  • Author_Institution
    Univ. of Michigan, Ann Arbor, MI, USA
  • fYear
    2011
  • fDate
    5-9 June 2011
  • Firstpage
    912
  • Lastpage
    917
  • Abstract
    As transistor dimensions continue to scale deep into the nanometer regime, silicon reliability is becoming a chief concern. At the same time, transistor counts are scaling up, enabling the design of highly integrated chips with many cores and a complex interconnect fabric, often a network on chip (NoC). Particularly problematic is the case when the accumulation of permanent hardware faults leads to disconnected cores in the system. In order to maintain correct system operation, it is necessary to salvage the data from these isolated nodes. In this work, we introduce a recovery mechanism targeting precisely this issue: DRAIN (Distributed Recovery Architecture for Inaccessible Nodes) provides system-level recovery from permanent failures. When an error disconnects a node from the network, DRAIN uses emergency links to transfer architectural state and cached data from disconnected nodes to nearby connected caches. DRAIN incurs zero performance penalty during normal operation, and is compatible with any cache coherence protocol, interconnect topology or routing protocol. Experimental results show that DRAIN is able to provide complete state recovery within several milliseconds, on average, for a 1GHz 64-node CMP at an area overhead of only a few thousand gates.
  • Keywords
    computer architecture; fault tolerant computing; multiprocessing systems; network-on-chip; 1GHz 64-node CMP; DRAIN; Distributed Recovery Architecture for Inaccessible Nodes; cache coherence protocol; distributed recovery architecture; multicore chip; network on chip; recovery mechanism; routing protocol; silicon reliability; system level recovery; transfer architectural state; zero performance penalty; Hardware; Logic gates; Network topology; Radiation detectors; Reliability; Routing; Topology; Fault-Tolerance; Network-on-Chip; Recovery; Resilient Systems;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Design Automation Conference (DAC), 2011 48th ACM/EDAC/IEEE
  • Conference_Location
    New York, NY
  • ISSN
    0738-100x
  • Print_ISBN
    978-1-4503-0636-2
  • Type

    conf

  • Filename
    5981885