闪电：将GPU编程模型扩展到单个GPU之外

论文标题

闪电：将GPU编程模型扩展到单个GPU之外

Lightning: Scaling the GPU Programming Model Beyond a Single GPU

论文作者

Heldens, Stijn, Hijma, Pieter, van Werkhoven, Ben, Maassen, Jason, van Nieuwpoort, Rob. V.

论文摘要

GPU编程模型主要旨在开发运行一个GPU的应用程序。但是，这将GPU代码的可伸缩性限制在计算功率和内存能力方面，将GPU代码的可扩展性限制在单个GPU的功能上。为了进一步扩展GPU应用程序，通常需要一项伟大的工程工作：必须手动将工作和数据划分为多个GPU，可能是在多个节点中，并且必须手动从GPU内存到高级记忆。我们提出闪电：一个遵循常见的GPU编程范式的框架，但可以轻松地扩展到大问题。闪电支持GPU内核的多GPU执行，即使跨多个节点也可以无缝地将数据溢出到更高级别的内存（主内存和磁盘）中。现有的CUDA内核可以轻松适应用于闪电，并在这些内核上进行了数据访问注释，从而允许闪电推断出其数据需求以及随后内核启动之间的依赖关系。 Lightning有效地在GPU上分发工作/数据，并在可能的情况下通过重叠的调度，数据移动和内核执行来最大化效率。我们介绍了闪电的设计和实施，以及在八个基准和一个现实世界中最多32 GPU的实验结果。评估显示出出色的性能和可伸缩性，例如使用16个GPU的照明在4个节点和80 GB数据的数据上加速57.2倍，远远超出了一个GPU的存储容量。

The GPU programming model is primarily aimed at the development of applications that run one GPU. However, this limits the scalability of GPU code to the capabilities of a single GPU in terms of compute power and memory capacity. To scale GPU applications further, a great engineering effort is typically required: work and data must be divided over multiple GPUs by hand, possibly in multiple nodes, and data must be manually spilled from GPU memory to higher-level memories. We present Lightning: a framework that follows the common GPU programming paradigm but enables scaling to large problems with ease. Lightning supports multi-GPU execution of GPU kernels, even across multiple nodes, and seamlessly spills data to higher-level memories (main memory and disk). Existing CUDA kernels can easily be adapted for use in Lightning, with data access annotations on these kernels allowing Lightning to infer their data requirements and the dependencies between subsequent kernel launches. Lightning efficiently distributes the work/data across GPUs and maximizes efficiency by overlapping scheduling, data movement, and kernel execution when possible. We present the design and implementation of Lightning, as well as experimental results on up to 32 GPUs for eight benchmarks and one real-world application. Evaluation shows excellent performance and scalability, such as a speedup of 57.2x over the CPU using Lighting with 16 GPUs over 4 nodes and 80 GB of data, far beyond the memory capacity of one GPU.

下载PDF全文

下载文献需遵守相关版权规定

论文标题