论文标题

神经梯度的最小差异无偏见的N:m稀疏性

Minimum Variance Unbiased N:M Sparsity for the Neural Gradients

论文作者

Chmiel, Brian, Hubara, Itay, Banner, Ron, Soudry, Daniel

论文摘要

在深度学习中,细颗粒的N:m散发性可将一般矩阵乘数(GEMM)的数据足迹和带宽降低至X2,并通过跳过零值的计算来倍增吞吐量。到目前为止,它主要仅用于修剪重量来加速前后相。我们研究了该方法如何用于神经梯度(即相对于中间神经层输出的损失梯度)。为此,我们首先建立张量级的最佳标准。以前的作品旨在最大程度地减少每个修剪块的均方越(MSE)。我们表明,尽管MSE的最小化可用于修剪重量和激活,但神经梯度的灾难性失败。取而代之的是,我们表明神经梯度的准确修剪需要公正的最小值修剪面膜。我们设计了这样的专业面具,发现在大多数情况下,1:2的稀疏性足以进行训练,而2:4的稀疏性通常是足够的。此外,我们建议将几种此类方法组合在一起,以便进一步加快训练的速度。

In deep learning, fine-grained N:M sparsity reduces the data footprint and bandwidth of a General Matrix multiply (GEMM) up to x2, and doubles throughput by skipping computation of zero values. So far, it was mainly only used to prune weights to accelerate the forward and backward phases. We examine how this method can be used also for the neural gradients (i.e., loss gradients with respect to the intermediate neural layer outputs). To this end, we first establish a tensor-level optimality criteria. Previous works aimed to minimize the mean-square-error (MSE) of each pruned block. We show that while minimization of the MSE works fine for pruning the weights and activations, it catastrophically fails for the neural gradients. Instead, we show that accurate pruning of the neural gradients requires an unbiased minimum-variance pruning mask. We design such specialized masks, and find that in most cases, 1:2 sparsity is sufficient for training, and 2:4 sparsity is usually enough when this is not the case. Further, we suggest combining several such methods together in order to potentially speed up training even more.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源