مرکز منطقه ای اطلاع رساني علوم و فناوري - Address Translation Optimization for Unified Parallel C Multi-dimensional Arrays

DocumentCode :

3145010

Title :

Address Translation Optimization for Unified Parallel C Multi-dimensional Arrays

Author :

Serres, Olivier ; Anbar, Ahmad ; Merchant, Saumil G. ; Kayi, Abdullah ; El-Ghazawi, Tarek

Author_Institution :

Dept. of Electr. & Comput. Eng., George Washington Univ., Washington, DC, USA

fYear :

2011

fDate :

16-20 May 2011

Firstpage :

1191

Lastpage :

1198

Abstract :

Partitioned Global Address Space (PGAS) languages offer significant programmability advantages with its global memory view abstraction, one-sided communication constructs and data locality awareness. These attributes place PGAS languages at the forefront of possible solutions to the exploding programming complexity in the many-core architectures. To enable the shared address space abstraction, PGAS languages use an address translation mechanism while accessing shared memory to convert shared addresses to physical addresses. This mechanism is already expensive in terms of performance in distributed memory environments, but it becomes a major bottleneck in machines with shared memory support where the access latencies are significantly lower. Multi- and many-core processors exhibit even lower latencies for shared data due to on-chip cache space utilization. Thus, efficient handling of address translation becomes even more crucial as this overhead may easily become the dominant factor in the overall data access time for such architectures. To alleviate address translation overhead, this paper introduces a new mechanism targeting multi-dimensional arrays used in most scientific and image processing applications. Relative costs and the implementation details for UPC are evaluated with different workloads (matrix multiplication, Random Access benchmark and Sobel edge detection) on two different platforms: a many-core system, the TILE64 (a 64 core processor) and a dual-socket, quad-core Intel Nehalem system (up to 16 threads). Our optimization provides substantial performance improvements, up to 40x. In addition, the proposed mechanism can easily be integrated into compilers abstracting it from the programmers. Accordingly, this improves UPC productivity as it will reduce manual optimization efforts required to minimize the address translation overhead.

Keywords :

C language; cache storage; distributed memory systems; distributed shared memory systems; parallel architectures; parallel programming; parallelising compilers; storage allocation; PGAS language; Random Access benchmark; Sobel edge detection; TILE64; access latency; address translation mechanism; address translation optimization; address translation overhead; data access; data locality awareness; distributed memory environment; dual-socket quad-core Intel Nehalem system; global memory view abstraction; many-core architecture; many-core processors; matrix multiplication; on-chip cache space utilization; one-sided communication construct; parallel programming; partitioned global address space language; physical address; programming complexity; shared address space abstraction; shared data; shared memory access; shared memory support; unified parallel C multidimensional array; Arrays; Electronics packaging; Instruction sets; Optimization; Table lookup; Tiles;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on

Conference_Location :

Shanghai

ISSN :

1530-2075

Print_ISBN :

978-1-61284-425-1

Electronic_ISBN :

1530-2075

Type :

conf

DOI :

10.1109/IPDPS.2011.279

Filename :

6008969

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3145010