Author_Institution :
Univ. of Delaware, Newark, DE, USA
Abstract :
Summary form only given. The popularity of serial execution paradigms in the High Performance Computing (HPC) field greatly hinders the ability of computational scientists to develop and support massively parallel programs. Programmers are left with languages that are inadequate to express parallel constructs, being forced to take decisions that are not directly related to the programs they write. Computer architects are forced to support sequential memory semantics only because serial languages require them and operating system designers are forced to support slow synchronization operations. This poster addresses the development and execution of HPC programs in many-core architectures by introducing the Time Iterated Dependency Flow (TIDeFlow) execution model. In TIDeFlow, programmers specify the precedence relations between computations without dealing with implementation details related to synchronization or scheduling. TIDeFlow is a graph-based model inspired by dataflow: Computations in a program are expressed as actors whose dependencies are represented by arcs. TIDeFlow departs from other dataflow models (1) in that actors represent parallel loops as the basic building block of HPC programs, (2) in that arcs between actors represent dependencies of any kind (data, control or other) and (3) in that arc weights allow delaying tokens to provide support for pipelining. TIDeFlow is related to other dataflow models (an excellent survey can be found in [1]) in the idea of executing several computations into a single actor as in Macro Dataflow[2] and in allowing multiple, concurrent executions of the same actor through coloring of tokens as in Dynamic Dataflow[3]. The resulting TIDeFlow model expresses HPC programs as directed graphs with weighted nodes and weighted arcs that represent parallel loops and loop carried dependencies respectively. The model is useful to support task pipelining, task migration, and distributed control. An implementation of TIDeFlow was d- veloped for Cyclops- 64[4], a 160-core architecture by IBM. The implementation resulted in new, highly concurrent algorithms-such as the HT-Queue[4]and in development of efficient representation of runtime system primitives such as polytasks[5]. The implementation is supported by a number of software tools including a graph programming model, a parallel intermediate representation form and a fully distributed runtime system. The effectiveness of TIDeFlow was tested using several HPC programs, including FDTD in 1 and 2 dimensions [6], Matrix Multiply, and FFT. In all cases, the programs were run using Cyclops-64, allowing excellent studies in scalability, performance, parallelism and overhead. The results of the experiments, presented in [4] and [5] show that TIDeFlow can efficiently support very fine grained execution due to its very low overhead and its distributed nature. The experiments also show excellent scalability, al- lowing close-to-linear scalability for 156 processors executing matrix multiply. The experiments also showed the advantages for development: Expressing dependencies using a graph was found to be easier than placing hand-coded synchronization constructs inside programs. The performance of the TIDeFlow runtime system was carefully measured, showing that it uses very few clock cycles to create, schedule and terminate tasks. The runtime system is fully distributed and lock-free, allowing runtime operations to be insensible to the load of the system. This poster introduces TIDeFlow by presenting (1) its graph programming model (weighted nodes and wighted arcs), (2) a description of composability in TIDeFlow, which allows the use of programs to build larger programs, (3) a brief description of the TIDeFlow runtime system and (4) a summary of the excellent results obtained, both in scalability and overhead for FDTD in 1 and 2 dimensions, Matrix Multiply, and FFT. This work contributes to the state of the art by: (1) Presenting a new execution model,
Keywords :
data flow computing; data flow graphs; multiprocessing systems; parallel programming; program control structures; Cyclops-64; FFT; HT-Queue; TIDeFlow model; concurrent algorithms; dataflow models; directed graphs; distributed control; distributed runtime system; dynamic dataflow; fast Fourier transform; graph programming model; graph-based model; high performance computing programs; macro dataflow; many-core architectures; matrix multiply; operating system designers; parallel execution model; parallel intermediate representation; parallel loops; parallel programs; polytasks; runtime system primitives; sequential memory semantics; serial languages; software tools; synchronization operations; task migration; task pipelining; time iterated dependency flow execution model; Computational modeling; Computer architecture; Programming; Runtime; Scalability; Synchronization; Time domain analysis; Cyclops-64; Dataflow; Dynamic dataflow; Macro dataflow; Manycore; Parallel Runtime Systems; Parallel execution model; Task Management; Tasking framework;