Sense：模型硬件共同设计，用于加速收缩阵列上的稀疏CNN

论文标题

Sense：模型硬件共同设计，用于加速收缩阵列上的稀疏CNN

Sense: Model Hardware Co-design for Accelerating Sparse CNN on Systolic Array

论文作者

Sun, Wenhao, Liu, Deng, Zou, Zhiwei, Sun, Wendi, Kang, Yi, Chen, Song

论文摘要

稀疏性是卷积神经网络（CNN）的内在属性，值得为CNN加速器开发，但是额外的处理与硬件开销有关，导致许多仅遭受较小利润的建筑。同时，收缩期阵列的高时空位置和低硬件开销的CNNS加速度越来越有竞争力。但是，稀疏性的不规则性在刚性收缩期数据流下导致工作量不平衡，从而导致性能降解。因此，本文提出了一种基于收缩压的架构，称为Sense，用于通过模型硬件共同设计的稀疏CNN加速，从而实现了较大的性能。为了平衡输入特征映射（IFM）和处理元件（PE）阵列之间的重量负载，我们应用了通道聚类，以收集具有近似稀疏性的IFM，以进行阵列计算，并共同设计了一种负载平衡的重量修剪方法，以使每个核心的稀疏度保持在某种值中，以确保准确的损失，提高PE损失，PE利用和整体效果。此外，适用自适应数据流程配置以根据IFM和权重的存储比确定计算策略，与燕子相比，降低了1.17x-1.8倍的DRAM访问，并进一步降低了系统的能量消耗。整个设计以200MHz的速度在Zynqzcu102上实现，并在Alexnet，VGG-16，Resnet-50和Googlenet和Googlenet的471-、34-，53-和191-Image/s上执行。与稀疏的基础阵列加速器，燕子，FESA和斑点相比，Sense分别在这些CNN上分别获得了1x-2.25x，1.95x-2.5x和1.17x-2.37倍的性能改善，并具有合理的开销。

Sparsity is an intrinsic property of convolutional neural network(CNN) and worth exploiting for CNN accelerators, but extra processing comes with hardware overhead, causing many architectures suffering from only minor profit. Meanwhile, systolic array has been increasingly competitive on CNNs acceleration for its high spatiotemporal locality and low hardware overhead. However, the irregularity of sparsity induces imbalanced workload under the rigid systolic dataflow, causing performance degradation. Thus, this paper proposed a systolicarray-based architecture, called Sense, for sparse CNN acceleration by model-hardware co-design, achieving large performance improvement. To balance input feature map(IFM) and weight loads across Processing Element(PE) array, we applied channel clustering to gather IFMs with approximate sparsity for array computation, and co-designed a load-balancing weight pruning method to keep the sparsity ratio of each kernel at a certain value with little accuracy loss, improving PE utilization and overall performance. Additionally, Adaptive Dataflow Configuration is applied to determine the computing strategy based on the storage ratio of IFMs and weights, lowering 1.17x-1.8x DRAM access compared with Swallow and further reducing system energy consumption. The whole design is implemented on ZynqZCU102 with 200MHz and performs at 471-, 34-, 53- and 191-image/s for AlexNet, VGG-16, ResNet-50 and GoogleNet respectively. Compared against sparse systolic-array-based accelerators, Swallow, FESA and SPOTS, Sense achieves 1x-2.25x, 1.95x-2.5x and 1.17x-2.37x performance improvement on these CNNs respectively with reasonable overhead.

下载PDF全文

下载文献需遵守相关版权规定

论文标题