DocumentCode :
3345077
Title :
Efficient implementation of QR decomposition on intel multi-core processors
Author :
Soliman, Mostafa I.
Author_Institution :
Electr. Eng. Dept., South Valley Univ., Aswan, Egypt
fYear :
2011
fDate :
27-28 Dec. 2011
Firstpage :
25
Lastpage :
30
Abstract :
This paper shows how to make the QR decomposition algorithm run faster on Intel multi-core processors by exploiting explicit parallelism and memory hierarchy. Streaming SIMD extensions and multithreading computation on multiple cores are used to exploit data-level parallelism (DLP) and thread-level parallelism (TLP), respectively. In addition, memory hierarchy is exploited by performing the QR computation on blocks of data to reduce the impact of memory latency by reusing the loaded data in cache memories. On Core 2 Duo E7500 with two cores (2-physical/2-logical processors), Core i5 M520 with two cores supporting Hyper-Threading technology (2-physical/4-logical processors), and Xeon E5410 with four cores (4-physical/4-logical processors), the average speedup of multithreaded SIMD implementation of the block QR decomposition on 1000×1000 up to 3000×3000 matrices in step of 100 are about 6.6, 9.6, and 11.3 times higher than the unparallel execution, respectively. On reasonably large matrix size 2000 × 2000 (4000 × 4000), our experimental results show that the use of Intel streaming SIMD extensions, multithreading, SIMD multithreading, matrix blocking, blocking SIMD, blocking multithreading, and blocking SIMD multithreading speedup QR decomposition on Core 2 Duo E7500 by factors of about 2.1 (2.1), 1.8 (1.8), 2.2 (2.2), 1.7 (1.7), 5.6 (5.6), 2.7 (2.6), and 6.6 (6.3), on Core i5 M520 by factors of about 3.7 (3.6), 2.2 (2.6), 3.8 (4), 1.9 (1.9), 7.9 (7.8), 2.9 (3), and 9.6 (10.7), and on Xeon E5410 by factors of about 2.6 (2.3), 3.2 (2.8), 4.7 (3), 1.5 (1.5), 5.4 (4.9), 5 (5.1), and 12.1 (7), respectively.
Keywords :
cache storage; microprocessor chips; multi-threading; multiprocessing systems; Core 2 Duo E7500; Core i5 M520; Intel multicore processors; Xeon E5410; block QR decomposition; cache memories; data level parallelism; hyper threading technology; loaded data reuse; multithreading computation; streaming SIMD extensions; thread level parallelism; Engines; Lithography; Matrix decomposition; TV; DLP; Householder transformation; QR decomposition; SIMD; TLP; matrix blocking; memory hierarchy; multi-core processors; multithreading; performance evaluation;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer Engineering Conference (ICENCO), 2011 Seventh International
Conference_Location :
Giza
Print_ISBN :
978-1-4673-0730-7
Type :
conf
DOI :
10.1109/ICENCO.2011.6153928
Filename :
6153928
Link To Document :
بازگشت