Using Multiple Threads to Accelerate Single Thread Performance

Author

Sura, Zehra ; O´Brien, Kevin ; Brunheroto, Jose

Author_Institution

IBM T.J. Watson Res. Center, Yorktown Heights, NY, USA

fYear

2014

fDate

19-23 May 2014

Firstpage

985

Lastpage

994

Abstract

Computing systems are being designed with an increasing number of hardware cores. To effectively use these cores, applications need to maximize the amount of parallel processing and minimize the time spent in sequential execution. In this work, we aim to exploit fine-grained parallelism beyond the parallelism already encoded in an application. We define an execution model using a primary core and some number of secondary cores that collaborate to speed up the execution of sequential code regions. This execution model relies on cores that are physically close to each other and have fast communication paths between them. For this purpose, we introduce dedicated hardware queues for low-latency transfer of values between cores, and define special "enque" and "deque" instructions to use the queues. Further, we develop compiler analyses and transformations to automatically derive fine-grained parallel code from sequential code regions. We implemented this model for exploiting fine-grained parallelization in the IBM XL compiler framework and in a simulator for the Blue Gene/Q system. We also studied the Sequoia benchmarks to determine code sections where our techniques are applicable. We evaluated our work using these code sections, and observed an average speedup of 1.32 on 2 cores, and an average speedup of 2.05 on 4 cores. Since these code sections are otherwise sequentially executed, we conclude that our approach is useful for accelerating single thread performance.

Keywords

multi-threading; parallelising compilers; program diagnostics; software performance evaluation; Blue Gene/Q system; IBM XL compiler framework; automatic fine-grained parallel code generation; code sections; compiler analysis; computing systems; deque instructions; enque instructions; execution model; fine-grained parallelism; hardware queues; low-latency value transfer; multithreading; parallel processing; sequential code region execution; sequential execution; single thread performance acceleration; time spent minimization; Acceleration; Benchmark testing; Hardware; Instruction sets; Parallel processing; Partitioning algorithms; Registers;

fLanguage

English

Publisher

ieee

Conference_Titel

Parallel and Distributed Processing Symposium, 2014 IEEE 28th International

Conference_Location

Phoenix, AZ

ISSN

1530-2075

Print_ISBN

978-1-4799-3799-8

Type

conf

DOI

10.1109/IPDPS.2014.104

Filename

6877328