• DocumentCode
    3271569
  • Title

    Practical formal correctness checking of million-core problem solving environments for HPC

  • Author

    de Oliveira, Diego Caminha B. ; Rakamaric, Zvonimir ; Gopalakrishnan, Ganesh ; Humphrey, Alan ; Qingyu Meng ; Berzins, Martin

  • Author_Institution
    Sch. of Comput., Univ. of Utah, Salt Lake City, UT, USA
  • fYear
    2013
  • fDate
    18-18 May 2013
  • Firstpage
    75
  • Lastpage
    83
  • Abstract
    While formal correctness checking methods have been deployed at scale in a number of important practical domains, we believe that such an experiment has yet to occur in the domain of high performance computing at the scale of a million CPU cores. This paper presents preliminary results from the Uintah Runtime Verification (URV) project that has been launched with this objective. Uintah is an asynchronous task-graph based problem-solving environment that has shown promising results on problems as diverse as fluid-structure interaction and turbulent combustion at well over 200K cores to date. Uintah has been tested on leading platforms such as Kraken, Keenland, and Titan consisting of multicore CPUs and GPUs, incorporates several innovative design features, and is following a roadmap for development well into the million core regime. The main results from the URV project to date are crystallized in two observations: (1) A diverse array of well-known ideas from lightweight formal methods and testing/observing HPC systems at scale have an excellent chance of succeeding. The real challenges are in finding out exactly which combinations of ideas to deploy, and where. (2) Large-scale problem solving environments for HPC must be designed such that they can be “crashed early” (at smaller scales of deployment) and “crashed often” (have effective ways of input generation and schedule perturbation that cause vulnerabilities to be attacked with higher probability). Furthermore, following each crash, one must “explain well” (given the extremely obscure ways in which an error finally manifests itself, we must develop ways to record information leading up to the crash in informative ways, to minimize offsite debugging burden). Our plans to achieve these goals and to measure our success are described. We also highlight some of the broadly applicable concepts and approaches.
  • Keywords
    chemically reactive flow; combustion; crystallisation; formal verification; graph theory; multiprocessing systems; parallel processing; problem solving; task analysis; turbulence; GPU; HPC systems testing; URV project; Uintah runtime verification; asynchronous task graph based problem solving environment; fluid structure interaction; formal correctness checking method; formal method; high performance computing; innovative design feature; million CPU core; million core problem solving environment; multicore CPU; turbulent combustion; Computer bugs; Graphics processing units; Problem-solving; Runtime; Schedules; Testing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Software Engineering for Computational Science and Engineering (SE-CSE), 2013 5th International Workshop on
  • Conference_Location
    San Francisco, CA
  • Type

    conf

  • DOI
    10.1109/SECSE.2013.6615102
  • Filename
    6615102