低精度浮点算术算术用于高性能FPGA的CNN加速度

论文标题

低精度浮点算术算术用于高性能FPGA的CNN加速度

Low Precision Floating-point Arithmetic for High Performance FPGA-based CNN Acceleration

论文作者

Wu, Chen, Wang, Mingyu, Chu, Xinyuan, Wang, Kun, He, Lei

论文摘要

低精度数据表示对于减少卷积神经网络（CNN）的存储尺寸和内存访问非常重要。然而，现有方法具有两个主要局限性：（1）需要重新训练以保持深度CNN的准确性，（2）需要16位浮点或8位固定点才能良好准确。在本文中，我们提出了基于FPGA的加速度的低精度（8位）浮点（LPFP）量化方法，以克服上述局限性。在没有任何重新训练的情况下，LPFP找到了具有可忽略的TOP-1/TOP-5精度损失的最佳8位数据表示（在我们的实验中分别在0.5％/0.3％之内，并且比现有的Deep CNN方法要好得多）。此外，我们通过一个8位LPFP乘法实现了一个4位乘数加热（Mac）和一个3位加法器，因此使用一个DSP Slice使用Xilinx Kintex 7家族（本文中的KC705）实现了四个8位LPFP乘法，而一个DSP只能实现两个8位8位固定乘积。在六个典型的CNN上进行推断的实验表明，平均而言，我们比Intel I9 CPU的吞吐量在64.5倍上，比现有FPGA加速器提高了1.5倍。特别是对于VGG16和Yolo而言，与最近的六个FPGA加速器相比，我们将平均吞吐量提高了3.5倍和27.5倍，并将每个DSP的平均吞吐量分别提高4.1倍和5倍。据我们所知，这是首次简化CNN推断为一个4位MAC并在一个DSP中实现四个乘法的深入研究，同时保持了可比精度而无需任何重新训练。

Low precision data representation is important to reduce storage size and memory access for convolutional neural networks (CNNs). Yet, existing methods have two major limitations: (1) requiring re-training to maintain accuracy for deep CNNs, and (2) needing 16-bit floating-point or 8-bit fixed-point for a good accuracy. In this paper, we propose a low precision (8-bit) floating-point (LPFP) quantization method for FPGA-based acceleration to overcome the above limitations. Without any re-training, LPFP finds an optimal 8-bit data representation with negligible top-1/top-5 accuracy loss (within 0.5%/0.3% in our experiments, respectively, and significantly better than existing methods for deep CNNs). Furthermore, we implement one 8-bit LPFP multiplication by one 4-bit multiply-adder (MAC) and one 3-bit adder, and therefore implement four 8-bit LPFP multiplications using one DSP slice of Xilinx Kintex 7 family (KC705 in this paper) while one DSP can implement only two 8-bit fixed-point multiplications. Experiments on six typical CNNs for inference show that on average, we improve throughput by 64.5x over Intel i9 CPU and by 1.5x over existing FPGA accelerators. Particularly for VGG16 and YOLO, compared to six recent FPGA accelerators, we improve average throughput by 3.5x and 27.5x and improve average throughput per DSP by 4.1x and 5x, respectively. To the best of our knowledge, this is the first in-depth study to simplify one multiplication for CNN inference to one 4-bit MAC and implement four multiplications within one DSP while maintaining comparable accuracy without any re-training.

下载PDF全文

下载文献需遵守相关版权规定

论文标题