DocumentCode
3345077
Title
Efficient implementation of QR decomposition on intel multi-core processors
Author
Soliman, Mostafa I.
Author_Institution
Electr. Eng. Dept., South Valley Univ., Aswan, Egypt
fYear
2011
fDate
27-28 Dec. 2011
Firstpage
25
Lastpage
30
Abstract
This paper shows how to make the QR decomposition algorithm run faster on Intel multi-core processors by exploiting explicit parallelism and memory hierarchy. Streaming SIMD extensions and multithreading computation on multiple cores are used to exploit data-level parallelism (DLP) and thread-level parallelism (TLP), respectively. In addition, memory hierarchy is exploited by performing the QR computation on blocks of data to reduce the impact of memory latency by reusing the loaded data in cache memories. On Core 2 Duo E7500 with two cores (2-physical/2-logical processors), Core i5 M520 with two cores supporting Hyper-Threading technology (2-physical/4-logical processors), and Xeon E5410 with four cores (4-physical/4-logical processors), the average speedup of multithreaded SIMD implementation of the block QR decomposition on 1000×1000 up to 3000×3000 matrices in step of 100 are about 6.6, 9.6, and 11.3 times higher than the unparallel execution, respectively. On reasonably large matrix size 2000 × 2000 (4000 × 4000), our experimental results show that the use of Intel streaming SIMD extensions, multithreading, SIMD multithreading, matrix blocking, blocking SIMD, blocking multithreading, and blocking SIMD multithreading speedup QR decomposition on Core 2 Duo E7500 by factors of about 2.1 (2.1), 1.8 (1.8), 2.2 (2.2), 1.7 (1.7), 5.6 (5.6), 2.7 (2.6), and 6.6 (6.3), on Core i5 M520 by factors of about 3.7 (3.6), 2.2 (2.6), 3.8 (4), 1.9 (1.9), 7.9 (7.8), 2.9 (3), and 9.6 (10.7), and on Xeon E5410 by factors of about 2.6 (2.3), 3.2 (2.8), 4.7 (3), 1.5 (1.5), 5.4 (4.9), 5 (5.1), and 12.1 (7), respectively.
Keywords
cache storage; microprocessor chips; multi-threading; multiprocessing systems; Core 2 Duo E7500; Core i5 M520; Intel multicore processors; Xeon E5410; block QR decomposition; cache memories; data level parallelism; hyper threading technology; loaded data reuse; multithreading computation; streaming SIMD extensions; thread level parallelism; Engines; Lithography; Matrix decomposition; TV; DLP; Householder transformation; QR decomposition; SIMD; TLP; matrix blocking; memory hierarchy; multi-core processors; multithreading; performance evaluation;
fLanguage
English
Publisher
ieee
Conference_Titel
Computer Engineering Conference (ICENCO), 2011 Seventh International
Conference_Location
Giza
Print_ISBN
978-1-4673-0730-7
Type
conf
DOI
10.1109/ICENCO.2011.6153928
Filename
6153928
Link To Document