مرکز منطقه ای اطلاع رساني علوم و فناوري - Fast Parallel Cutoff Pair Interactions for Molecular Dynamics on Heterogeneous Systems

Author/Authors :

Wu, Qiang National University of Defense Technology - School of Computer Science, China , Yang, Canqun National University of Defense Technology - School of Computer Science, China , Tang, Tao National University of Defense Technology - School of Computer Science, China , Lu, Kai National University of Defense Technology - School of Computer Science, China

Abstract :

Heterogeneous systems with both Central Processing Units (CPUs) and Graphics Processing Units (GPUs) are frequently used to accelerate short-ranged Molecular Dynamics (MD) simulations. The most time-consuming task in short-ranged MD simulations is the computation of particle-to-particle interactions. Beyond a certain distance, these interactions decrease to zero. To minimize the operations to investigate distance, previous works have tiled interactions by employing the spatial attribute, which increases the memory access and GPU computations, hence decreasing performance. Other studies ignore the spatial attribute and construct an all-versus-all interaction matrix, which has poor scalability. This paper presents an improved algorithm. The algorithm first bins particles into voxels according to the spatial attributes, and then tiles the all-versus-all matrix into voxel-versus-voxel sub-matrixes. Only the sub-matrixes between neighboring voxels are computed on the GPU. Therefore, the algorithm reduces the distance examine operations and limits additional memory access and GPU computations. This paper also adopts a multi-level programming model to implement the algorithm on multi-nodes of Tianhe-lA. By employing (1) a patch design to exploit parallelism across the simulation domain, (2) a communication overlapping method to overlap the communications between CPUs and GPUs, and (3) a dynamic workload balancing method to adjust the workloads among compute nodes, the implementation achieves a speedup of 4.16x on one NVIDIA Tesla M2050 GPU compared to a 2.93 GHz six-core Intel Xeon X5670 CPU. In addition, it runs 2.41x faster on 256 compute nodes of Tianhe-lA (with two CPUs and one GPU inside a node) than on 256 GPU-excluded nodes.