使用HLS用于芯片设计的系统，可扩展有效的卷积神经网络加速器

论文标题

使用HLS用于芯片设计的系统，可扩展有效的卷积神经网络加速器

A scalable and efficient convolutional neural network accelerator using HLS for a System on Chip design

论文作者

Bjerge, Kim, Schougaard, Jonathan Horsted, Larsen, Daniel Ejnar

论文摘要

本文为芯片设计系统（SOC）提供了可配置的卷积神经网络加速器（CNNA）。目的是加速嵌入式SOC平台上不同深度学习网络的推断。呈现的CNNA具有可扩展的体系结构，该体系结构使用高级合成（HLS）和SystemC用于硬件加速器。它能够加速从Python导出的任何卷积神经网络（CNN），并支持卷积，最大通风和完全连接的层的组合。提出了一种具有定量重量的训练方法，并在论文中呈现。 CNNA基于模板，使其能够扩展Xilinx Zynq平台的不同目标。这种方法可以使设计空间探索，这使得可以探索在C和RTL仿真过程中CNNA的几种配置，从而将其拟合到所需的平台和模型中。 CNN VGG16使用Pynq在Xilinx Ultra96板上测试溶液。与类似的浮点模型相比，使用自动尺度的固定点Q2.14格式具有高度的训练精度。它能够在2.0秒内执行推断，而平均功耗为2.63 W，这对应于6.0 GOPS/W的功率效率。

This paper presents a configurable Convolutional Neural Network Accelerator (CNNA) for a System on Chip design (SoC). The goal was to accelerate inference of different deep learning networks on an embedded SoC platform. The presented CNNA has a scalable architecture which uses High Level Synthesis (HLS) and SystemC for the hardware accelerator. It is able to accelerate any Convolutional Neural Network (CNN) exported from Python and supports a combination of convolutional, max-pooling, and fully connected layers. A training method with fixed-point quantized weights is proposed and presented in the paper. The CNNA is template-based, enabling it to scale for different targets of the Xilinx Zynq platform. This approach enables design space exploration, which makes it possible to explore several configurations of the CNNA during C- and RTL-simulation, fitting it to the desired platform and model. The CNN VGG16 was used to test the solution on a Xilinx Ultra96 board using PYNQ. The result gave a high level of accuracy in training with an auto-scaled fixed-point Q2.14 format compared to a similar floating-point model. It was able to perform inference in 2.0 seconds, while having an average power consumption of 2.63 W, which corresponds to a power efficiency of 6.0 GOPS/W.

下载PDF全文

下载文献需遵守相关版权规定

论文标题