Author_Institution :
Sch. of Comput. Sci. & Technol., Beijing Inst. of Technol., Beijing, China
Abstract :
Data prefetching speculatively issue memory requests for data needed later by the main computation, and therefore can lead to increased stress on limited resources on chip multiprocessors. If not properly used, it can cause harmful effects such as cache pollution and waste of bandwidth. Therefore, accurate and fine grain measurement of the related runtime metrics is important as the first step in reducing harmful prefetches and increasing memory level parallelism on chip multiprocessors. However, the required measurement is prohibitively impossible on real machines without bringing nontrivial performance overhead and thus leading to inaccurate results. In this paper, we use cycle accurate full-system simulation to study the memory system performance of our previous proposed data prefetching technique with control of harmful prefetches on chip multiprocessors - software-initiated inter-core LLC prepushing. We modified the GEMS multiprocessor simulator to support trace-based measurement and offline analysis of MLP, DRAM BLP and their relationship with software-initiated intercore LLC prepushing. Results show that, prepushing can achieve speedups of 1.628, 1.019 and 1.032 in mst, em3d and 429.mcf, respectively. Average L2 MLP is increased by 26%, 0.3% and-1%, in mst, em3d and 429.mcf, respectively.
Keywords :
DRAM chips; microprocessor chips; storage management; DRAM BLP; GEMS multiprocessor simulator; MLP; cycle accurate full-system simulation; data prefetching; memory level parallelism; memory system performance evaluation; on chip multiprocessors; software-initiated intercore LLC Prepushing; Measurement; Multicore processing; Object oriented modeling; Prefetching; Random access memory; System performance; architectural simulation; chip multiprocessors; data prefetching; memory system performance;