DocumentCode
3775723
Title
Performance optimization of load imbalanced workloads in large scale Dragonfly systems
Author
Bogdan Prisacari;German Rodriguez;Cyriel Minkenberg;Marina Garcia;Enrique Vallejo;Ramon Beivide
Author_Institution
IBM Research - Zurich, Saumerstrasse 4, 8803 Ruschlikon, Switzerland
fYear
2015
fDate
7/1/2015 12:00:00 AM
Firstpage
1
Lastpage
6
Abstract
Dragonfly topologies are one of the most promising interconnect designs for enabling large, potentially exascale compute systems, particularly those envisioned to accommodate workloads that are sensitive to system diameter and end-to-end latency. They are cost-effective designs with a very low diameter and close to optimal performance for workloads which induce a balanced load across the network. However, these benefits are balanced by a reduced path diversity, which leaves Dragonflies vulnerable to certain adversarial traffic patterns. The performance of such workloads can be significantly improved using indirect routing approaches. However, the indirect routing approach that is most commonly used today exhibits in turn significant vulnerability to a subset of these traffic patterns for reasons that have not been, up to now entirely, understood. In exploring this vulnerability, we manage to provide a theoretical justification, based on inherent properties of the Dragonfly topology, of why performance degrades. Furthermore, we manage to isolate what specifically in the structure of a traffic pattern makes it a worst case in this context, and thus we are able to characterize the precise workload subset that will experience poor performance. By building upon the understanding of the interaction that causes sub-optimal behavior, we then show how simple changes to either the routing strategy or the process to node assignment can bring performance back close to ideal levels. Finally, we not only provide a theoretical justification for our performance models, but also validate them via comprehensive simulation-based studies of systems with up to 16,512 nodes.
Keywords
"Switches","Routing","Topology","Ports (Computers)","Network topology","Degradation","Conferences"
Publisher
ieee
Conference_Titel
High Performance Switching and Routing (HPSR), 2015 IEEE 16th International Conference on
Electronic_ISBN
2325-5609
Type
conf
DOI
10.1109/HPSR.2015.7483107
Filename
7483107
Link To Document