Title :
Investigating resilient high performance reconfigurable computing with minimally-invasive system monitoring
Author :
Bin Huang ; Schmidt, Andrew G. ; Mendon, Ashwin A. ; Sass, Ron
Author_Institution :
Dept. of Electr. & Comput. Eng., Univ. of North Carolina at Charlotte, Charlotte, NC, USA
Abstract :
As researchers push for Exascale computing, one of the emerging challenges is system resilience. Unlike fault-tolerance which corrects errors, recent reports suggest that resilient systems will need to continue to make progress on an application despite faults. A first step in developing a resilient system is to have robust, scalable system monitoring. The work described here presents a novel, minimally-invasive system monitor that operates over a separate network. We analytically characterize the performance for an arbitrary set of nodes and demonstrate a working implementation of the design. We argue that the hardware approach is inherently superior to the ad hoc, software techniques currently employed in practice.
Keywords :
invasive software; software fault tolerance; system monitoring; Exascale computing; fault tolerance; minimally-invasive system monitoring; resilient high performance reconfigurable computing; resilient system; scalable system monitoring; software technique; Biomedical monitoring; Field programmable gate arrays; Hardware; Magnetic heads; Monitoring; Resilience; Software;
Conference_Titel :
High-Performance Reconfigurable Computing Technology and Applications ( HPRCTA), 2010 Fourth International Workshop on
Conference_Location :
New Orleans, LA
Print_ISBN :
978-1-4244-9516-0
Electronic_ISBN :
2150-7945
DOI :
10.1109/HPRCTA.2010.5670795