论文标题

使用HLS用于芯片设计的系统,可扩展有效的卷积神经网络加速器

A scalable and efficient convolutional neural network accelerator using HLS for a System on Chip design

论文作者

Bjerge, Kim, Schougaard, Jonathan Horsted, Larsen, Daniel Ejnar

论文摘要

本文为芯片设计系统(SOC)提供了可配置的卷积神经网络加速器(CNNA)。目的是加速嵌入式SOC平台上不同深度学习网络的推断。呈现的CNNA具有可扩展的体系结构,该体系结构使用高级合成(HLS)和SystemC用于硬件加速器。它能够加速从Python导出的任何卷积神经网络(CNN),并支持卷积,最大通风和完全连接的层的组合。提出了一种具有定量重量的训练方法,并在论文中呈现。 CNNA基于模板,使其能够扩展Xilinx Zynq平台的不同目标。这种方法可以使设计空间探索,这使得可以探索在C和RTL仿真过程中CNNA的几种配置,从而将其拟合到所需的平台和模型中。 CNN VGG16使用Pynq在Xilinx Ultra96板上测试溶液。与类似的浮点模型相比,使用自动尺度的固定点Q2.14格式具有高度的训练精度。它能够在2.0秒内执行推断,而平均功耗为2.63 W,这对应于6.0 GOPS/W的功率效率。

This paper presents a configurable Convolutional Neural Network Accelerator (CNNA) for a System on Chip design (SoC). The goal was to accelerate inference of different deep learning networks on an embedded SoC platform. The presented CNNA has a scalable architecture which uses High Level Synthesis (HLS) and SystemC for the hardware accelerator. It is able to accelerate any Convolutional Neural Network (CNN) exported from Python and supports a combination of convolutional, max-pooling, and fully connected layers. A training method with fixed-point quantized weights is proposed and presented in the paper. The CNNA is template-based, enabling it to scale for different targets of the Xilinx Zynq platform. This approach enables design space exploration, which makes it possible to explore several configurations of the CNNA during C- and RTL-simulation, fitting it to the desired platform and model. The CNN VGG16 was used to test the solution on a Xilinx Ultra96 board using PYNQ. The result gave a high level of accuracy in training with an auto-scaled fixed-point Q2.14 format compared to a similar floating-point model. It was able to perform inference in 2.0 seconds, while having an average power consumption of 2.63 W, which corresponds to a power efficiency of 6.0 GOPS/W.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源