Title :
A dynamic schema to increase performance in many-core architectures through percolation operations
Author :
Garcia, Eloy ; Orozco, Daniel ; Khan, Raees ; Venetisz, Ioannis E. ; Livingston, Kelly ; Gao, Guang R.
Author_Institution :
Electr. & Comput. Eng. Dept., Univ. of Delaware, Newark, DE, USA
Abstract :
Optimization of parallel applications under new many-core architectures is challenging even for regular applications. Successful strategies inherited from previous generations of parallel or serial architectures just return incremental gains in performance and further optimization and tuning are required. We argue that conservative static optimizations are not the best fit for modern many-core architectures. The limited advantages of static techniques come from the new scenarios present in many-cores: Plenty of thread units sharing several resources under different coordination mechanisms. We point out that scheduling and data movement across the memory hierarchy are extremely important in the performance of applications. In particular, we found that scheduling of data movement operations significantly impact performance. To overcome those difficulties, we took advantage of the fine-grain synchronization primitives of many-cores to define percolation operations in order to schedule data movement properly. In addition, we have fused percolation operations with dynamic scheduling into a dynamic percolation approach. We used Dense Matrix Multiplication on a modern manycore to illustrate how our proposed techniques are able to increase the performance under these new environments. In our study on the IBM Cyclops-64, we raised the performance from 44 GFLOPS (out of 80 GFLOPS possible) to 70.0 GFLOPS (operands in on-chip memory) and 65.6 GFLOPS (operands in off-chip memory). The success of our approach also resulted in excellent power efficiency: 1.09 GFLOPS/Watt and 993 MFLOPS/Watt when the input data resided in on-chip and off-chip memory respectively.
Keywords :
dynamic scheduling; matrix multiplication; multiprocessing systems; optimisation; parallel architectures; synchronisation; IBM Cyclops-64; data movement operation scheduling; dense matrix multiplication; dynamic percolation approach; dynamic scheduling; fine-grain synchronization; many-core architectures; memory hierarchy; off-chip memory; on-chip memory; parallel application optimization; parallel architectures; percolation operations; performance optimization; power efficiency; resource sharing; serial architectures; Computer architecture; Optimization; Program processors; Random access memory; Registers; System-on-chip; Tiles;
Conference_Titel :
High Performance Computing (HiPC), 2013 20th International Conference on
Conference_Location :
Bangalore
DOI :
10.1109/HiPC.2013.6799134