实时1280x720对象检测芯片，具有585MB/s的内存流量

论文标题

实时1280x720对象检测芯片，具有585MB/s的内存流量

A Real Time 1280x720 Object Detection Chip With 585MB/s Memory Traffic

论文作者

Chang, Kuo-Wei, Shih, Hsu-Tung, Chang, Tian-Sheuan, Tsai, Shang-Hong, Yang, Chih-Chyau, Wu, Chien-Ming, Huang, Chun-Ming

论文摘要

内存带宽已成为当前深度学习加速器（DLA）的实时瓶颈，特别是对于高清（HD）对象检测。在资源限制下，本文提出了使用联合硬件和软件优化的低内存流量DLA芯片。为了在内存带宽下最大化硬件利用率，我们将对象检测模型变形并融合到组融合准备的模型中，以减少中间数据访问。这将Yolov2的功能内存流量从2.9 GB/S减少到0.15 GB/s。为了支持组融合，我们以前的基于DLA的硬件雇用了一个统一的缓冲区，并在Fusion组中对简单的逐层处理进行写掩模。与以前具有相同PE编号的DLA相比，在TSMC 40NM过程中实现的芯片支持1280x720@30FPS对象检测，并且消耗7.9倍的外部DRAM访问能量，从2607 MJ到327.6 MJ。

Memory bandwidth has become the real-time bottleneck of current deep learning accelerators (DLA), particularly for high definition (HD) object detection. Under resource constraints, this paper proposes a low memory traffic DLA chip with joint hardware and software optimization. To maximize hardware utilization under memory bandwidth, we morph and fuse the object detection model into a group fusion-ready model to reduce intermediate data access. This reduces the YOLOv2's feature memory traffic from 2.9 GB/s to 0.15 GB/s. To support group fusion, our previous DLA based hardware employes a unified buffer with write-masking for simple layer-by-layer processing in a fusion group. When compared to our previous DLA with the same PE numbers, the chip implemented in a TSMC 40nm process supports 1280x720@30FPS object detection and consumes 7.9X less external DRAM access energy, from 2607 mJ to 327.6 mJ.

下载PDF全文

下载文献需遵守相关版权规定

论文标题