F8NET：固定点8位仅用于网络量化的乘法

论文标题

F8NET：固定点8位仅用于网络量化的乘法

F8Net: Fixed-Point 8-bit Only Multiplication for Network Quantization

论文作者

Jin, Qing, Ren, Jian, Zhuang, Richard, Hanumante, Sumant, Li, Zhengang, Chen, Zhiyu, Wang, Yanzhi, Yang, Kaiyuan, Tulyakov, Sergey

论文摘要

神经网络量化是一种有希望的压缩技术，可减少记忆足迹并节省能源消耗，并可能导致实时推断。但是，量化模型和全精度模型之间存在性能差距。为了减少它，现有的量化方法需要高精度INT32或在推断缩放或取消化过程中进行全精确乘法。这在记忆，速度和所需能源方面引入了明显的成本。为了解决这些问题，我们提出了F8NET，这是一个仅由定点8位乘法组成的新型量化框架。为了得出我们的方法，我们首先讨论了定点乘法的优势，并以不同格式的定点数字进行了不同，并研究了相关的固定点的统计行为。其次，基于统计和算法分析，我们将不同的定点格式应用于不同层的权重和激活。我们引入了一种新型算法，以自动确定训练过程中每层的正确格式。第三，我们分析了先前的量化算法 - 参数化剪辑激活（PACT），并使用定点算术进行重新进行重新重新制定。最后，我们统一了最近提出的量化微调方法和我们的定点方法，以显示我们方法的潜力。我们在Imabilenet V1/V2和Resnet18/50上验证F8NET的F8NET。当与INT32乘法或浮点算术算术的现有量化技术相比，我们的方法可以实现可比性和更好的性能，而且还与完整精确的对应物进行了比较。

Neural network quantization is a promising compression technique to reduce memory footprint and save energy consumption, potentially leading to real-time inference. However, there is a performance gap between quantized and full-precision models. To reduce it, existing quantization approaches require high-precision INT32 or full-precision multiplication during inference for scaling or dequantization. This introduces a noticeable cost in terms of memory, speed, and required energy. To tackle these issues, we present F8Net, a novel quantization framework consisting of only fixed-point 8-bit multiplication. To derive our method, we first discuss the advantages of fixed-point multiplication with different formats of fixed-point numbers and study the statistical behavior of the associated fixed-point numbers. Second, based on the statistical and algorithmic analysis, we apply different fixed-point formats for weights and activations of different layers. We introduce a novel algorithm to automatically determine the right format for each layer during training. Third, we analyze a previous quantization algorithm -- parameterized clipping activation (PACT) -- and reformulate it using fixed-point arithmetic. Finally, we unify the recently proposed method for quantization fine-tuning and our fixed-point approach to show the potential of our method. We verify F8Net on ImageNet for MobileNet V1/V2 and ResNet18/50. Our approach achieves comparable and better performance, when compared not only to existing quantization techniques with INT32 multiplication or floating-point arithmetic, but also to the full-precision counterparts, achieving state-of-the-art performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题