论文标题

F8NET:固定点8位仅用于网络量化的乘法

F8Net: Fixed-Point 8-bit Only Multiplication for Network Quantization

论文作者

Jin, Qing, Ren, Jian, Zhuang, Richard, Hanumante, Sumant, Li, Zhengang, Chen, Zhiyu, Wang, Yanzhi, Yang, Kaiyuan, Tulyakov, Sergey

论文摘要

神经网络量化是一种有希望的压缩技术,可减少记忆足迹并节省能源消耗,并可能导致实时推断。但是,量化模型和全精度模型之间存在性能差距。为了减少它,现有的量化方法需要高精度INT32或在推断缩放或取消化过程中进行全精确乘法。这在记忆,速度和所需能源方面引入了明显的成本。为了解决这些问题,我们提出了F8NET,这是一个仅由定点8位乘法组成的新型量化框架。为了得出我们的方法,我们首先讨论了定点乘法的优势,并以不同格式的定点数字进行了不同,并研究了相关的固定点的统计行为。其次,基于统计和算法分析,我们将不同的定点格式应用于不同层的权重和激活。我们引入了一种新型算法,以自动确定训练过程中每层的正确格式。第三,我们分析了先前的量化算法 - 参数化剪辑激活(PACT),并使用定点算术进行重新进行重新重新制定。最后,我们统一了最近提出的量化微调方法和我们的定点方法,以显示我们方法的潜力。我们在Imabilenet V1/V2和Resnet18/50上验证F8NET的F8NET。当与INT32乘法或浮点算术算术的现有量化技术相比,我们的方法可以实现可比性和更好的性能,而且还与完整精确的对应物进行了比较。

Neural network quantization is a promising compression technique to reduce memory footprint and save energy consumption, potentially leading to real-time inference. However, there is a performance gap between quantized and full-precision models. To reduce it, existing quantization approaches require high-precision INT32 or full-precision multiplication during inference for scaling or dequantization. This introduces a noticeable cost in terms of memory, speed, and required energy. To tackle these issues, we present F8Net, a novel quantization framework consisting of only fixed-point 8-bit multiplication. To derive our method, we first discuss the advantages of fixed-point multiplication with different formats of fixed-point numbers and study the statistical behavior of the associated fixed-point numbers. Second, based on the statistical and algorithmic analysis, we apply different fixed-point formats for weights and activations of different layers. We introduce a novel algorithm to automatically determine the right format for each layer during training. Third, we analyze a previous quantization algorithm -- parameterized clipping activation (PACT) -- and reformulate it using fixed-point arithmetic. Finally, we unify the recently proposed method for quantization fine-tuning and our fixed-point approach to show the potential of our method. We verify F8Net on ImageNet for MobileNet V1/V2 and ResNet18/50. Our approach achieves comparable and better performance, when compared not only to existing quantization techniques with INT32 multiplication or floating-point arithmetic, but also to the full-precision counterparts, achieving state-of-the-art performance.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源