• DocumentCode
    1778180
  • Title

    Online error detection and recovery in dataflow execution

  • Author

    Alves, Tiago A. O. ; Kundu, Sandipan ; Marzulo, Leandro A. J. ; Franca, Felipe M. G.

  • Author_Institution
    Programa de Eng. de Sist. e Comput., COPPE Univ. Fed. do Rio de Janeiro, Rio de Janeiro, Brazil
  • fYear
    2014
  • fDate
    7-9 July 2014
  • Firstpage
    9
  • Lastpage
    104
  • Abstract
    The processor industry is well on its way towards manycore processors that comprise of large number of simple cores. The shift towards multi and manycores calls for new programming paradigms suitable for exploiting the inherent parallelism in applications. Dataflow execution was shown to be a good option for programming in such environments. It is well-known that as CMOS technology continues to scale, it becomes more prone to transient and permanent hardware faults. In this paper we present a novel mechanism for error detection and recovery that focuses on transient errors in dataflow execution. Due to the inherently parallel nature of dataflow, our solution is completely distributed and synchronizes only cores that have data dependencies between them, as opposed to prior work on error recovery that in general rely on global synchronization of the system. We evaluate the proposed solution via a software implementation on top of a dataflow runtime. Experimental results show that error detection overhead is highly related to the pressure on the memory bus. In memory bound applications, performance is found to deteriorate, while for other benchmarks, the observed overhead is less than 23%. We find no comparable previous work to contrast these results.
  • Keywords
    CMOS integrated circuits; circuit analysis computing; data flow computing; error detection; integrated circuit reliability; synchronisation; CMOS technology; data dependencies; dataflow execution; global synchronization; manycore processors; memory bus; online error detection; online error recovery; permanent hardware faults; processor industry; transient errors; transient faults; Benchmark testing; Bit error rate; Hardware; Image edge detection; Parallel processing; Runtime; Transient analysis;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    On-Line Testing Symposium (IOLTS), 2014 IEEE 20th International
  • Conference_Location
    Platja d´Aro, Girona
  • Type

    conf

  • DOI
    10.1109/IOLTS.2014.6873679
  • Filename
    6873679