Title :
Zedwulf: Power-Performance Tradeoffs of a 32-Node Zynq SoC Cluster
Author :
Moorthy, Pradeep ; Kapre, Nachiket
Author_Institution :
Sch. of Comput. Eng., Nanyang Technol. Univ., Singapore, Singapore
Abstract :
Commodity SoCs with hybrid architectures that combine CPUs with programmable FPGA fabric such as the Xilinx Zynq SoC have become a competitive energy-efficient platform for addressing irregular parallelism in graph problems. In this paper, we prototype a 32-node cluster composed from these Zynq SoC chips to accelerate communication-bound sparse graph-oriented applications such as neural network simulations. We develop specialized MPI routines specifically developed for irregular accelerator-to-accelerator communication of small message traffic. We use the ARM processor for handling the MPI stack while offloading compute-intensive calculations to the FPGA. For graphs with 32M nodes and 32M edges, Zedwulf delivers the highest 94 MTEPS (Million Traversed Edges Per Second)throughput over other x86 multi-threaded platforms in our study by 1.2 -- 1.4×. For this experiment, Zedwulf operates at an efficiency of 0.49 MTEPS/W when using ARM+FPGA which is1.2× better than using ARMv7 CPUs alone, and within 8% of the Intel Core i7-4770k platform.
Keywords :
application program interfaces; field programmable gate arrays; message passing; power aware computing; system-on-chip; ARM processor; MPI routines; Zedwulf; Zynq SoC cluster; communication-bound sparse graph-oriented applications; field programmable gate array; message passing interface; neural network simulations; power-performance tradeoff; programmable FPGA fabric; system-on-chip; Bandwidth; Computational modeling; Field programmable gate arrays; Neurons; Random access memory; System-on-chip; Throughput;
Conference_Titel :
Field-Programmable Custom Computing Machines (FCCM), 2015 IEEE 23rd Annual International Symposium on
Conference_Location :
Vancouver, BC
DOI :
10.1109/FCCM.2015.37