DocumentCode :
3271569
Title :
Practical formal correctness checking of million-core problem solving environments for HPC
Author :
de Oliveira, Diego Caminha B. ; Rakamaric, Zvonimir ; Gopalakrishnan, Ganesh ; Humphrey, Alan ; Qingyu Meng ; Berzins, Martin
Author_Institution :
Sch. of Comput., Univ. of Utah, Salt Lake City, UT, USA
fYear :
2013
fDate :
18-18 May 2013
Firstpage :
75
Lastpage :
83
Abstract :
While formal correctness checking methods have been deployed at scale in a number of important practical domains, we believe that such an experiment has yet to occur in the domain of high performance computing at the scale of a million CPU cores. This paper presents preliminary results from the Uintah Runtime Verification (URV) project that has been launched with this objective. Uintah is an asynchronous task-graph based problem-solving environment that has shown promising results on problems as diverse as fluid-structure interaction and turbulent combustion at well over 200K cores to date. Uintah has been tested on leading platforms such as Kraken, Keenland, and Titan consisting of multicore CPUs and GPUs, incorporates several innovative design features, and is following a roadmap for development well into the million core regime. The main results from the URV project to date are crystallized in two observations: (1) A diverse array of well-known ideas from lightweight formal methods and testing/observing HPC systems at scale have an excellent chance of succeeding. The real challenges are in finding out exactly which combinations of ideas to deploy, and where. (2) Large-scale problem solving environments for HPC must be designed such that they can be “crashed early” (at smaller scales of deployment) and “crashed often” (have effective ways of input generation and schedule perturbation that cause vulnerabilities to be attacked with higher probability). Furthermore, following each crash, one must “explain well” (given the extremely obscure ways in which an error finally manifests itself, we must develop ways to record information leading up to the crash in informative ways, to minimize offsite debugging burden). Our plans to achieve these goals and to measure our success are described. We also highlight some of the broadly applicable concepts and approaches.
Keywords :
chemically reactive flow; combustion; crystallisation; formal verification; graph theory; multiprocessing systems; parallel processing; problem solving; task analysis; turbulence; GPU; HPC systems testing; URV project; Uintah runtime verification; asynchronous task graph based problem solving environment; fluid structure interaction; formal correctness checking method; formal method; high performance computing; innovative design feature; million CPU core; million core problem solving environment; multicore CPU; turbulent combustion; Computer bugs; Graphics processing units; Problem-solving; Runtime; Schedules; Testing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Software Engineering for Computational Science and Engineering (SE-CSE), 2013 5th International Workshop on
Conference_Location :
San Francisco, CA
Type :
conf
DOI :
10.1109/SECSE.2013.6615102
Filename :
6615102
Link To Document :
بازگشت