带宽优化的平行算法，用于使用传播阻滞的稀疏基质矩阵乘法

论文标题

带宽优化的平行算法，用于使用传播阻滞的稀疏基质矩阵乘法

Bandwidth-Optimized Parallel Algorithms for Sparse Matrix-Matrix Multiplication using Propagation Blocking

论文作者

Gu, Zhixiang, Moreira, Jose, Edelsohn, David, Azad, Ariful

论文摘要

稀疏基质矩阵乘法（SPGEMM）是各种图，科学计算和机器学习算法中广泛使用的内核。众所周知，SPGEMM是一个记忆结合的操作，其峰值性能有望受内存带宽的约束。然而，现有算法无法使内存带宽饱和，从而在车顶线模型下导致了次优性能。在本文中，我们根据其内存访问模式来表征现有的SPGEMM算法，并为SPGEMM性能开发实际的下限和上限。然后，我们基于外产物矩阵乘法开发SPGEMM算法。新开发的称为PB-SPGEMM的算法通过使用传播阻塞技术和执行调查中的排序和合并来使内存带宽饱和。对于许多实用的矩阵，PB-SPGEMM的运行速度比现代多机处理器上的最先进的堆和哈希SPGEMM算法快20％-50％。最重要的是，PB-SPGEMM达到了车顶模型预测的性能，并且其性能在基质大小和稀疏度方面保持稳定。

Sparse matrix-matrix multiplication (SpGEMM) is a widely used kernel in various graph, scientific computing and machine learning algorithms. It is well known that SpGEMM is a memory-bound operation, and its peak performance is expected to be bound by the memory bandwidth. Yet, existing algorithms fail to saturate the memory bandwidth, resulting in suboptimal performance under the Roofline model. In this paper we characterize existing SpGEMM algorithms based on their memory access patterns and develop practical lower and upper bounds for SpGEMM performance. We then develop an SpGEMM algorithm based on outer product matrix multiplication. The newly developed algorithm called PB-SpGEMM saturates memory bandwidth by using the propagation blocking technique and by performing in-cache sorting and merging. For many practical matrices, PB-SpGEMM runs 20%-50% faster than the state-of-the-art heap and hash SpGEMM algorithms on modern multicore processors. Most importantly, PB-SpGEMM attains performance predicted by the Roofline model, and its performance remains stable with respect to matrix size and sparsity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题