DocumentCode :
656211
Title :
Hierarchical Parallel Matrix Multiplication on Large-Scale Distributed Memory Platforms
Author :
Quintin, Jean-Noel ; Hasanov, Khalid ; Lastovetsky, Alexey
Author_Institution :
Extrem Comput. R&D Bull, France
fYear :
2013
fDate :
1-4 Oct. 2013
Firstpage :
754
Lastpage :
762
Abstract :
Matrix multiplication is a very important computation kernel both in its own right as a building block of many scientific applications and as a popular representative for other scientific applications. Cannon´s algorithm which dates back to 1969 was the first efficient algorithm for parallel matrix multiplication providing theoretically optimal communication cost. However this algorithm requires a square number of processors. In the mid-1990s, the SUMMA algorithm was introduced. SUMMA overcomes the shortcomings of Cannon´s algorithm as it can be used on a nonsquare number of processors as well. Since then the number of processors in HPC platforms has increased by two orders of magnitude making the contribution of communication in the overall execution time more significant. Therefore, the state of the art parallel matrix multiplication algorithms should be revisited to reduce the communication cost further. This paper introduces a new parallel matrix multiplication algorithm, Hierarchical SUMMA (HSUMMA), which is a redesign of SUMMA. Our algorithm reduces the communication cost of SUMMA by introducing a two-level virtual hierarchy into the two-dimensional arrangement of processors. Experiments on an IBM Blue Gene/P demonstrate the reduction of communication cost up to 2.08 times on 2048 cores and up to 5.89 times on 16384 cores.
Keywords :
distributed memory systems; matrix multiplication; parallel algorithms; HSUMMA algorithm; IBM BlueGene/P; communication cost reduction; computa- tion kernel; hierarchical SUMMA algorithm; hierarchical parallel matrix multiplication; large-scale distributed memory platforms; two-dimensional processor arrangement; two-level virtual hierarchy; Algorithm design and analysis; Bandwidth; Clustering algorithms; Computational modeling; Educational institutions; Program processors; Three-dimensional displays; BlueGene; Communication; Exascale; Grid5000; Hierarchical algorithm; Matrix multiplication; Parallel algorithm;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel Processing (ICPP), 2013 42nd International Conference on
Conference_Location :
Lyon
ISSN :
0190-3918
Type :
conf
DOI :
10.1109/ICPP.2013.89
Filename :
6687414
Link To Document :
بازگشت