Warpcore：gpus上快速哈希表的库

论文标题

Warpcore：gpus上快速哈希表的库

WarpCore: A Library for fast Hash Tables on GPUs

论文作者

Jünger, Daniel, Kobus, Robin, Müller, André, Hundt, Christian, Xu, Kai, Liu, Weiguo, Schmidt, Bertil

论文摘要

哈希表无处不在。诸如用于插入和查询的持续时间复杂性以及紧凑的内存布局之类的属性使它们具有歧管应用程序的多功能关联数据结构。在许多领域中出现的数据迅速增长促使需要为现代平行体系结构设计的加速哈希表的需求。在这项工作中，我们利用现代GPU的快速存储器接口以及量身定制的平行哈希方案，以改善全局内存访问模式，设计warpcore-哈希表数据结构的多功能库。唯一的设备侧操作允许在GPU上完全构建高性能数据处理管道。我们的实施在单个GV100 GPU上达到了16亿插入量，每秒可达43亿次检索，从而超过了最先进的解决方案CUDPP，Slabhash和Nvidia Rapids Cudf。对于超过$ 90 \％$的高负载因子，这种性能优势变得更加明显。为了克服单个GPU的内存限制，我们将方法扩展到密集的NVLINK拓扑上，这使我们在DGX服务器上几乎可以缩小弱尺度。我们进一步展示了如何使用Warpcore来加速现实世界的生物信息学应用程序（元基因组分类），并加速了两个超过两种基于CPU的解决方案。 WC用C ++/CUDA-C编写，并在https://github.com/sleeepyjack/warpcore上公开获得。

Hash tables are ubiquitous. Properties such as an amortized constant time complexity for insertion and querying as well as a compact memory layout make them versatile associative data structures with manifold applications. The rapidly growing amount of data emerging in many fields motivated the need for accelerated hash tables designed for modern parallel architectures. In this work, we exploit the fast memory interface of modern GPUs together with a parallel hashing scheme tailored to improve global memory access patterns, to design WarpCore -- a versatile library of hash table data structures. Unique device-sided operations allow for building high performance data processing pipelines entirely on the GPU. Our implementation achieves up to 1.6 billion inserts and up to 4.3 billion retrievals per second on a single GV100 GPU thereby outperforming the state-of-the-art solutions cuDPP, SlabHash, and NVIDIA RAPIDS cuDF. This performance advantage becomes even more pronounced for high load factors of over $90\%$. To overcome the memory limitation of a single GPU, we scale our approach over a dense NVLink topology which gives us close-to-optimal weak scaling on DGX servers. We further show how WarpCore can be used for accelerating a real world bioinformatics application (metagenomic classification) with speedups of over two orders-of-magnitude against state-of-the-art CPU-based solutions. WC is written in C++/CUDA-C and is openly available at https://github.com/sleeepyjack/warpcore.

下载PDF全文

下载文献需遵守相关版权规定

论文标题