论文标题
最佳的剪辑和幅度差异化,以改善量化感知训练
Optimal Clipping and Magnitude-aware Differentiation for Improved Quantization-aware Training
论文作者
论文摘要
数据剪辑对于降低量化操作中的噪声和提高量化感知训练(QAT)的准确性至关重要。当前的实践依赖于启发式方法来设置剪接阈值标量,不能证明是最佳的。我们提出了最佳的剪切张量和向量(OCTAV),这是一种递归算法,以确定MSE最佳的剪切标量。 Octav源自Fast Newton-Raphson方法,在QAT例程的每一个迭代中,都可以随时发现最佳的剪切标量。因此,QAT算法在每个步骤中都具有最小的量化噪声。此外,我们揭示了QAT中常见梯度估计技术的局限性,并提出了幅度感知的分化,以进一步提高准确性。在实验上,启用八叶速度的QAT在多个任务上实现了最新的准确性。其中包括在ImageNet上进行训练,并在ImageNet上进行重新启动和移动器,以及使用BERT模型进行微调,其中启用八度的QAT始终以低精度(4到6位)保留精度。我们的结果不需要对基线训练配方进行任何修改,除了在适当的情况下插入量化操作。
Data clipping is crucial in reducing noise in quantization operations and improving the achievable accuracy of quantization-aware training (QAT). Current practices rely on heuristics to set clipping threshold scalars and cannot be shown to be optimal. We propose Optimally Clipped Tensors And Vectors (OCTAV), a recursive algorithm to determine MSE-optimal clipping scalars. Derived from the fast Newton-Raphson method, OCTAV finds optimal clipping scalars on the fly, for every tensor, at every iteration of the QAT routine. Thus, the QAT algorithm is formulated with provably minimum quantization noise at each step. In addition, we reveal limitations in common gradient estimation techniques in QAT and propose magnitude-aware differentiation as a remedy to further improve accuracy. Experimentally, OCTAV-enabled QAT achieves state-of-the-art accuracy on multiple tasks. These include training-from-scratch and retraining ResNets and MobileNets on ImageNet, and Squad fine-tuning using BERT models, where OCTAV-enabled QAT consistently preserves accuracy at low precision (4-to-6-bits). Our results require no modifications to the baseline training recipe, except for the insertion of quantization operations where appropriate.