XPULPNN：在基于RISC-V的IoT端节点上启用量化神经网络的节能和灵活的推断

论文标题

XPULPNN：在基于RISC-V的IoT端节点上启用量化神经网络的节能和灵活的推断

XpulpNN: Enabling Energy Efficient and Flexible Inference of Quantized Neural Network on RISC-V based IoT End Nodes

论文作者

Garofalo, Angelo, Tagliavini, Giuseppe, Conti, Francesco, Benini, Luca, Rossi, Davide

论文摘要

这项工作将轻量级扩展引入RISC-V ISA，以提高对微控制器级内核的大量量化神经网络（QNN）推断的效率。通过将ISA用nibble（4位）和面包屑（2位）SIMD指令扩展，我们就可以在QNN计算的关键内核上显示近乎线性的速度。此外，我们为SIMD点总产品量操作提出了一个自定义执行范例，该范式包括将DOT产品与负载操作进行融合，与标准执行方案相比，DOT产品具有高达1.64倍的峰值Mac/Cycle改进。为了进一步提高效率，我们将RISC-V扩展核心集成在8个处理器的平行群集中，相对于单个核心体系结构，其近乎线性的改进。为了评估所提出的扩展，我们在GF22FDX技术中充分实施了处理器群。与仅支持8位SIMD说明的基线处理群集相比，在考虑4和2位数据操作数时，QNN卷积内核分别在考虑4和2位数据操作数时，实现了6 x和8 x的速度。峰值为2.22 tops/s/w，提出的解决方案达到了与专用DNN推理加速器相当的效率水平，并且最多三个数量级要比基于ARM Cortex-M的最先进的基于ARM Cortex-M的微控制器系统，例如低端STM32L4 MCU和高端End End End End End End End End End Stm322H7 McU。

This work introduces lightweight extensions to the RISC-V ISA to boost the efficiency of heavily Quantized Neural Network (QNN) inference on microcontroller-class cores. By extending the ISA with nibble (4-bit) and crumb (2-bit) SIMD instructions, we are able to show near-linear speedup with respect to higher precision integer computation on the key kernels for QNN computation. Also, we propose a custom execution paradigm for SIMD sum-of-dot-product operations, which consists of fusing a dot product with a load operation, with an up to 1.64x peak MAC/cycle improvement compared to a standard execution scenario. To further push the efficiency, we integrate the RISC-V extended core in a parallel cluster of 8 processors, with near-linear improvement with respect to a single core architecture. To evaluate the proposed extensions, we fully implement the cluster of processors in GF22FDX technology. QNN convolution kernels on a parallel cluster implementing the proposed extension run 6 x and 8 x faster when considering 4- and 2-bit data operands, respectively, compared to a baseline processing cluster only supporting 8-bit SIMD instructions. With a peak of 2.22 TOPs/s/W, the proposed solution achieves efficiency levels comparable with dedicated DNN inference accelerators, and up to three orders of magnitude better than state-of-the-art ARM Cortex-M based microcontroller systems such as the low-end STM32L4 MCU and the high-end STM32H7 MCU.

下载PDF全文

下载文献需遵守相关版权规定

论文标题