GPU Accelerating for Rapid Multi-core Cache Simulation

Author

Han, Wan ; Xiang, Long ; Xiaopeng, Gao ; Yi, Li

Author_Institution

State Key Lab. of Virtual Reality Technol. & Syst., Beihang Univ., Beijing, China

fYear

2011

fDate

16-20 May 2011

Firstpage

1387

Lastpage

1396

Abstract

To find the best memory system for emerging workloads, traces are obtained during application´s execution, then caches with different configurations are simulated using these traces. Since program traces can be several gigabytes, simulation of cache performance is a time consuming process. Compute unified device architecture (CUDA) is a software development platform which enables programmers to accelerate the general-purpose applications on the graphics processing unit (GPU). This paper presents a real time multi-core cache simulator, which was built based on the Pin tool to get the memory reference, and fast method for multi-core cache simulation using the CUDA-enabled GPU. The proposed method is accelerated by the following techniques: execution parallelism exploration, memory latency hiding, a novel trace compression methodology. We describe how these techniques can be incorporated into CUDA code. Experimental results show that the hybrid parallel method of time-partitioning combines with set-partitioning presented here is 11.10× speedup compared to the CPU serial simulation algorithm. The present simulator can characterize cache performance of single-threaded or multi-threaded workloads at the speeds of 6-15 MIPS. It can simulates 6 cache configurations within one single pass at this speeds compared to CMP$im, which can only simulate one cache configuration each simulation pass at the speeds of 4-10 MIPS.

Keywords

cache storage; computer architecture; computer graphic equipment; coprocessors; multiprocessing systems; parallel processing; CUDA; GPU; Pin tool; compute unified device architecture; execution parallelism exploration; graphics processing unit; hybrid parallel method; memory latency hiding; memory system; multicore cache simulation; program traces; set-partitioning; time-partitioning; trace compression methodology; Computational modeling; Graphics processing unit; Instruction sets; Instruments; Parallel processing; Partitioning algorithms;

fLanguage

English

Publisher

ieee

Conference_Titel

Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on

Conference_Location

Shanghai

ISSN

1530-2075

Print_ISBN

978-1-61284-425-1

Electronic_ISBN

1530-2075

Type

conf

DOI

10.1109/IPDPS.2011.295

Filename

6008993