Preliminary Experiments with XKaapi on Intel Xeon Phi Coprocessor

Author

Lima, Joao V. F. ; Broquedis, Francois ; Gautier, Thierry ; Raffin, Bruno

Author_Institution

Grenoble Inst. of Technol., Grenoble, France

fYear

2013

fDate

23-26 Oct. 2013

Firstpage

105

Lastpage

112

Abstract

This paper presents preliminary performance comparisons of parallel applications developed natively for the Intel Xeon Phi accelerator using three different parallel programming environments and their associated runtime systems. We compare Intel OpenMP, Intel CilkPlus and XKaapi together on the same benchmark suite and we provide comparisons between an Intel Xeon Phi coprocessor and a Sandy Bridge Xeon-based machine. Our benchmark suite is composed of three computing kernels: a Fibonacci computation that allows to study the overhead and the scalability of the runtime system, a NQueens application generating irregular and dynamic tasks and a Cholesky factorization algorithm. We also compare the Cholesky factorization with the parallel algorithm provided by the Intel MKL library for Intel Xeon Phi. Performance evaluation shows our XKaapi data-flow parallel programming environment exposes the lowest overhead of all and is highly competitive with native OpenMP and CilkPlus environments on Xeon Phi. Moreover, the efficient handling of data-flow dependencies between tasks makes our XKaapi environment exhibit more parallelism for some applications such as the Cholesky factorization. In that case, we observe substantial gains with up to 180 hardware threads over the state of the art MKL, with a 47% performance increase for 60 hardware threads.

Keywords

Fibonacci sequences; benchmark testing; coprocessors; data flow computing; matrix decomposition; message passing; multi-threading; parallel algorithms; Cholesky factorization algorithm; Fibonacci computation; Intel CilkPlus; Intel MKL library; Intel OpenMP; Intel Xeon Phi Coprocessor; Intel Xeon Phi accelerator; NQueens application; Sandy Bridge Xeon-based machine; XKaapi data-flow parallel programming environment; benchmark suite; computing kernels; data-flow dependency handling; dynamic tasks; hardware thread; irregular tasks; parallel algorithm; parallel applications; performance evaluation; runtime system overhead; runtime system scalability; Benchmark testing; Computer architecture; Coprocessors; Hardware; Instruction sets; Parallel programming; Runtime; Intel Xeon Phi; accelerators; data-flow programming; runtime systems; work stealing;

fLanguage

English

Publisher

ieee

Conference_Titel

Computer Architecture and High Performance Computing (SBAC-PAD), 2013 25th International Symposium on

Conference_Location

Porto de Galinhas

Print_ISBN

978-1-4799-2927-6

Type

conf

DOI

10.1109/SBAC-PAD.2013.28

Filename

6702586