建模为自我相似密度工作负载的GPU动态平行性

论文标题

建模为自我相似密度工作负载的GPU动态平行性

Modeling GPU Dynamic Parallelism for Self Similar Density Workloads

论文作者

Quezada, Felipe A., Navarro, Cristóbal A., Romero, Miguel, Aguilera, Cristhian

论文摘要

动态并行性（DP）是GPU编程模型的运行时功能，它允许GPU线程递归执行其他GPU内核。除了使平行层次模式的编程更加容易外，DP还可以通过通过细分过程将有限的GPU资源聚焦在显示出更多并行性的子区域上，从而加速了表现出异质数据布局的加速问题。但是，执行最佳细分过程并不是微不足道的，因为有不同的参数在DP的最终性能中起着重要作用。此外，当前的DP编程抽象还引入了一个间接费用，可以惩罚最终性能。在这项工作中，我们为表现出自我相似密度（SSD）工作负载（例如分形）的问题提供了一个细分成本模型，以了解哪些参数提供了最快的细分方法。另外，我们引入了一个新的细分实现，称为\ textit {自适应串行核}（ask），是CUDA动态并行性的较小开销替代方案。使用Mandelbrot集中的成本模型作为案例研究表明，最佳方案是从$ g = [2,16] $之间的初始细分开始，然后在$ r = 2,4 $的区域中保持细分，并在区域达到$ b \ b \ sim 32 $ 32 $时停止。实验结果与理论参数一致，证实了成本模型的可用性。在性能方面，拟议的询问方法比Mandelbrot集合中的动态并行性快于$ \ sim 60 \％$，并且比基本详尽的实现更快$ 12 \ times $ $，而DP的价格高达$ 7.5 \ times $。

Dynamic Parallelism (DP) is a runtime feature of the GPU programming model that allows GPU threads to execute additional GPU kernels, recursively. Apart from making the programming of parallel hierarchical patterns easier, DP can also speedup problems that exhibit a heterogeneous data layout by focusing, through a subdivision process, the finite GPU resources on the sub-regions that exhibit more parallelism. However, doing an optimal subdivision process is not trivial, as there are different parameters that play an important role in the final performance of DP. Moreover, the current programming abstraction for DP also introduces an overhead that can penalize the final performance. In this work we present a subdivision cost model for problems that exhibit self similar density (SSD) workloads (such as fractals), in order understand what parameters provide the fastest subdivision approach. Also, we introduce a new subdivision implementation, named \textit{Adaptive Serial Kernels} (ASK), as a smaller overhead alternative to CUDA's Dynamic Parallelism. Using the cost model on the Mandelbrot Set as a case study shows that the optimal scheme is to start with an initial subdivision between $g=[2,16]$, then keep subdividing in regions of $r=2,4$, and stop when regions reach a size of $B \sim 32$. The experimental results agree with the theoretical parameters, confirming the usability of the cost model. In terms of performance, the proposed ASK approach runs up to $\sim 60\%$ faster than Dynamic Parallelism in the Mandelbrot set, and up to $12\times$ faster than a basic exhaustive implementation, whereas DP is up to $7.5\times$.

下载PDF全文

下载文献需遵守相关版权规定

论文标题