• DocumentCode
    3183743
  • Title

    Checkpoint/Restart and Beyond: Resilient High Performance Computing with FPGAs

  • Author

    Schmidt, Andrew G. ; Huang, Bin ; Sass, Ron ; French, Matthew

  • Author_Institution
    Reconfigurable Comput. Syst. Lab., Univ. of North Carolina at Charlotte, Charlotte, NC, USA
  • fYear
    2011
  • fDate
    1-3 May 2011
  • Firstpage
    162
  • Lastpage
    169
  • Abstract
    As FPGA resources continue to increase, FPGAs present attractive features to the High Performance Computing community. These include the power-efficient computation and application-specific acceleration benefits, as well as tighter integration between compute and I/O resources. This paper considers the ability of an FPGA to address another, increasingly important, feature - resiliency. Specifically, a minimally-invasive monitoring infrastructure operating over a sideband network is presented. This includes a multi-chip protocol, IP cores that implement the protocol, and a tool to instrument existing hardware accelerator FPGA designs. To demonstrate the functionality, the system has been implemented on a cluster of FPGA devices running off-the-shelf MPI and Linux. We demonstrate the ability to do integrated software and hardware accelerator check pointing with restart under a variety of injected faults.
  • Keywords
    IP networks; Linux; application program interfaces; checkpointing; field programmable gate arrays; integrated software; message passing; parallel machines; power aware computing; FPGA device; I/O resource; IP core; Linux; application specific acceleration benefit; hardware accelerator FPGA design; hardware accelerator checkpointing; integrated software; minimally-invasive monitoring infrastructure; multichip protocol; off-the-shelf MPI; power efficient computation; resilient high performance computing; sideband network; Amplitude modulation; Context; Field programmable gate arrays; Hardware; Monitoring; Registers; Software; Checkpoint Restart; FPGA; High Performance Computing; Reconfigurable Computing; Resiliency;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Field-Programmable Custom Computing Machines (FCCM), 2011 IEEE 19th Annual International Symposium on
  • Conference_Location
    Salt Lake City, UT
  • Print_ISBN
    978-1-61284-277-6
  • Electronic_ISBN
    978-0-7695-4301-7
  • Type

    conf

  • DOI
    10.1109/FCCM.2011.22
  • Filename
    5771268