髋骨：Nekbone基准的C ++版本

论文标题

髋骨：Nekbone基准的C ++版本

HipBone: A performance-portable GPU-accelerated C++ version of the NekBone benchmark

论文作者

Chalmers, Noel, Mishra, Abhishek, McDougall, Damon, Warburton, Tim

论文摘要

我们提出了Hipbone，这是NEK5000（和NEKRS）CFD应用程序的开源性能 - 可容纳代理应用程序。 Hipbone是原始Nekbone CPU代理应用程序的完全GPU加速C ++实现，并具有多种新颖的算法和实现改进，可在现代细粒平行GPU加速器上优化其性能。我们的优化包括以组装形式存储问题自由度的转换，以减少在主迭代期间移动的数据量和主要泊松操作员内核的便携式实现。我们在两个不同供应商的三个不同的现代GPU加速器上展示了操作员内核的近距离性能。我们提出了一种新颖的算法，用于将泊松操作员在GPU上的应用分解，该算法会积极隐藏光环交换和组装所需的MPI通信。我们实施最近的邻居MPI通信，然后利用几种不同的路由算法和GPU-Direct RDMA功能（如果可用），从而提高了基准的可扩展性。我们证明了髋骨在橡树岭国家实验室内的三个不同集群上的表现，即峰会超级计算机和前沿早期访问群集，Spock和Crusher。我们的测试证明了不同群集之间的可移植性和非常好的缩放效率，尤其是在大问题上。

We present hipBone, an open source performance-portable proxy application for the Nek5000 (and NekRS) CFD applications. HipBone is a fully GPU-accelerated C++ implementation of the original NekBone CPU proxy application with several novel algorithmic and implementation improvements which optimize its performance on modern fine-grain parallel GPU accelerators. Our optimizations include a conversion to store the degrees of freedom of the problem in assembled form in order to reduce the amount of data moved during the main iteration and a portable implementation of the main Poisson operator kernel. We demonstrate near-roofline performance of the operator kernel on three different modern GPU accelerators from two different vendors. We present a novel algorithm for splitting the application of the Poisson operator on GPUs which aggressively hides MPI communication required for both halo exchange and assembly. Our implementation of nearest-neighbor MPI communication then leverages several different routing algorithms and GPU-Direct RDMA capabilities, when available, which improves scalability of the benchmark. We demonstrate the performance of hipBone on three different clusters housed at Oak Ridge National Laboratory, namely the Summit supercomputer and the Frontier early-access clusters, Spock and Crusher. Our tests demonstrate both portability across different clusters and very good scaling efficiency, especially on large problems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题