Title :
Be Kind, Rewind: Checkpoint & Restore Capability for Improving Reliability of Large-Scale Semiconductor Design
Author :
Ljubuncic, Igor ; Rozenfeld, Avikam ; Goldis, Andrew ; Giri, Ravi
Author_Institution :
Intel Corp., Petach-Tikva, Israel
Abstract :
Intel´s chip design run in a large-scale globally distributed environment with 600,000 cores. In the current semiconductor market scenario, a combination of factors such as time to market pressure, explosive growth in the mobile market segment and upcoming new markets has led to a significant increase in the demand for and reliability of computing resources. Checkpointing is a capability that can make a significant improvement in improving reliability, however, there is no mature solution that allows periodic snapshots of running compute jobs for replay them at a later time in a consistent manner in a large scale environment. Intel IT has partnered with the Northeastern University (NEU) Distributed Multi-Threaded Checkpointing (DMTCP) team to improve their checkpoint & restore solution for the design computing environment. This paper elaborates on the innovative technological breakthroughs, industry-academy partnership as well as the open-source contribution.
Keywords :
checkpointing; integrated circuit design; microprocessor chips; semiconductor device reliability; DMTCP; Intel chip design; Northeastern University; distributed multithreaded checkpointing; industry-academy partnership; large scale environment; large-scale semiconductor design; mobile market segment; open-source contribution; Checkpointing; Computational modeling; Computer architecture; Image restoration; Kernel; Reliability; CPU design; Checkpoint & Restore; Checkpointing; DMTCP; Distributed MultiThreaded Checkpointing; Engineering Computing; Information Technology; Intel;
Conference_Titel :
Intelligent Networking and Collaborative Systems (INCoS), 2014 International Conference on
Print_ISBN :
978-1-4799-6386-7
DOI :
10.1109/INCoS.2014.90