针对NVIDIA V100 GPU的OpenMP编译器的性能评估

论文标题

针对NVIDIA V100 GPU的OpenMP编译器的性能评估

Performance Assessment of OpenMP Compilers Targeting NVIDIA V100 GPUs

论文作者

Davis, Joshua Hoke, Daley, Christopher, Pophale, Swaroop, Huber, Thomas, Chandrasekaran, Sunita, Wright, Nicholas J.

论文摘要

异质系统越来越普遍。为了利用此类系统的丰富计算资源，需要强大的编程模型才能使应用程序开发人员无缝将遗产代码从当今的系统迁移到明天。在过去的十年中，已建立了指令，作为解决新兴系统的程序挑战的有前途的途径之一。这项工作着重于在五个代理应用程序上应用和演示OpenMP卸载指令。我们观察到，从一个编译器到另一个编译器，性能差异很大。我们工作的一个关键方面是向使用OpenMP卸载编译器的应用程序开发人员报告最佳实践。尽管开发人员可以解决某些问题，但还必须向编译器供应商报告其他问题。通过重组OpenMP卸载指令，我们在使用Clang Compiler时在NERSC的Cori系统上获得了18倍的速度，并通过使用CRAY-LLLVM Compiler在CORI上的Cray-LllVM Compiler时在Laplace Mini-App中添加最大降低来增加15.7倍的加速。

Heterogeneous systems are becoming increasingly prevalent. In order to exploit the rich compute resources of such systems, robust programming models are needed for application developers to seamlessly migrate legacy code from today's systems to tomorrow's. Over the past decade and more, directives have been established as one of the promising paths to tackle programmatic challenges on emerging systems. This work focuses on applying and demonstrating OpenMP offloading directives on five proxy applications. We observe that the performance varies widely from one compiler to the other; a crucial aspect of our work is reporting best practices to application developers who use OpenMP offloading compilers. While some issues can be worked around by the developer, there are other issues that must be reported to the compiler vendors. By restructuring OpenMP offloading directives, we gain an 18x speedup for the su3 proxy application on NERSC's Cori system when using the Clang compiler, and a 15.7x speedup by switching max reductions to add reductions in the laplace mini-app when using the Cray-llvm compiler on Cori.

下载PDF全文

下载文献需遵守相关版权规定

论文标题