在CPU/GPU架构上分发了内存的SVD

论文标题

在CPU/GPU架构上分发了内存的SVD

Distributed Out-of-Memory SVD on CPU/GPU Architectures

论文作者

Boureima, Ismael, Bhattarai, Manish, Eren, Maksim E., Solovyev, Nick, Djidjev, Hristo, Alexandrov, Boian S.

论文摘要

我们提出了用于异质性（CPU+GPU）高性能计算（HPC）系统的截短奇异值分解（T-SVD）的有效，分布式，内置的实现。已经提出了各种SVD的实现，但大多数仅估计奇异值作为对奇异向量的估计，这些矢量可以显着增加算法的时间和记忆复杂性。在这项工作中，我们提出了基于功率方法的SVD的实现，该方法是截断的奇异值和奇异向量估计方法。电源方法中看到的内存利用瓶颈通常与革兰氏矩阵$ \ mat {a}^t \ mat {a} $的计算相关联，当$ \ mat {a} $很大且密集时，这可能很重要，或者当$ \ sut {a} $是超级且超级且超级且超级且稀疏。提出的实现已针对内存外问题进行了优化，在这些问题中，将给定矩阵分解所需的内存大于可用的GPU内存。我们通过使用批处理策略来降低$ \ mat {a}^t \ mat {a} $的内存复杂性，其中中间因素是按块计算的。我们还通过使用CUDA流将每个批次副本重叠与计算，抑制与主机到设备（H2D）和设备对寄托（D2H）批次副本相关的I/O延迟。此外，我们使用基于优化的\ textit {NCCL}的通信器来减少与集体通信相关的延迟（包括节点内和节点）。另外，用GPU核（或可用的张力核心）显着加速了稀疏和致密的矩阵乘积，从而导致具有良好缩放的实现。我们证明了我们分布的核心SVD算法的可伸缩性，以成功分解1TB大小的密集矩阵和稀疏128pb的稀疏矩阵，具有1E-6的稀疏性。

We propose an efficient, distributed, out-of-memory implementation of the truncated singular value decomposition (t-SVD) for heterogeneous (CPU+GPU) high performance computing (HPC) systems. Various implementations of SVD have been proposed, but most only estimate the singular values as an estimation of the singular vectors which can significantly increase the time and memory complexity of the algorithm. In this work, we propose an implementation of SVD based on the power method, which is a truncated singular values and singular vectors estimation method. Memory utilization bottlenecks seen in the power method are typically associated with the computation of the Gram matrix $\mat{A}^T\mat{A}$, which can be significant when $\mat{A}$ is large and dense, or when $\mat{A}$ is super-large and sparse. The proposed implementation is optimized for out-of-memory problems where the memory required to factorize a given matrix is greater than the available GPU memory. We reduce the memory complexity of $\mat{A}^T\mat{A}$ by using a batching strategy where the intermediate factors are computed block by block. We also suppress I/O latency associated with both host-to-device (H2D) and device-to-host (D2H) batch copies by overlapping each batch copy with compute using CUDA streams. Furthermore, we use optimized \textit{NCCL} based communicators to reduce the latency associated with collective communications (both intra-node and inter-node). In addition, sparse and dense matrix multiplications are significantly accelerated with GPU cores (or tensors cores when available), resulting in an implementation with good scaling. We demonstrate the scalability of our distributed out of core SVD algorithm to successfully decompose dense matrix of size 1TB and sparse matrix of size 128PB with 1e-6 sparsity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题