点云的高效FPGA加速器

论文标题

点云的高效FPGA加速器

An Efficient FPGA Accelerator for Point Cloud

论文作者

Wang, Zilun, Mao, Wendong, Yang, Peixiang, Wang, Zhongfeng, Lin, Jun

论文摘要

基于深度学习的点云处理在各种视觉任务中起着重要作用，例如自动驾驶，虚拟现实（VR）和增强现实（AR）。 Submanifold稀疏卷积网络（SSCN）由于视觉结果的独特优势而被广泛用于点云。然而，由于极端和非结构化的稀疏性以及中央激活的稀疏性与邻里的稀疏性之间的复杂计算依赖性，现有的卷积神经网络加速器在加速SSCN时会遭受非平凡的性能降解。在本文中，我们为SSCN提出了一个基于FPGA的高性能加速器。首先，我们制定了零删除策略来消除粗粒冗余区域，从而显着提高了计算效率。其次，我们提出了一个简洁的编码方案，以获取匹配信息以进行有效的点乘积。第三，我们基于提出的编码方案开发一个稀疏的数据匹配单元和计算核心，该方案可以将不规则的稀疏操作转换为常规的多重蓄能操作。最后，在Xilinx ZCU102现场可编程阵列板上开发并实现了Submanifold稀疏卷积层的有效硬件体系结构，其中3D Submanifold稀疏U-net被视为基准。实验结果表明，与GPU相比，我们的设计极大地提高了计算效率，并且可以将功率效率大大提高51倍。

Deep learning-based point cloud processing plays an important role in various vision tasks, such as autonomous driving, virtual reality (VR), and augmented reality (AR). The submanifold sparse convolutional network (SSCN) has been widely used for the point cloud due to its unique advantages in terms of visual results. However, existing convolutional neural network accelerators suffer from non-trivial performance degradation when employed to accelerate SSCN because of the extreme and unstructured sparsity, and the complex computational dependency between the sparsity of the central activation and the neighborhood ones. In this paper, we propose a high performance FPGA-based accelerator for SSCN. Firstly, we develop a zero removing strategy to remove the coarse-grained redundant regions, thus significantly improving computational efficiency. Secondly, we propose a concise encoding scheme to obtain the matching information for efficient point-wise multiplications. Thirdly, we develop a sparse data matching unit and a computing core based on the proposed encoding scheme, which can convert the irregular sparse operations into regular multiply-accumulate operations. Finally, an efficient hardware architecture for the submanifold sparse convolutional layer is developed and implemented on the Xilinx ZCU102 field-programmable gate array board, where the 3D submanifold sparse U-Net is taken as the benchmark. The experimental results demonstrate that our design drastically improves computational efficiency, and can dramatically improve the power efficiency by 51 times compared to GPU.

下载PDF全文

下载文献需遵守相关版权规定

论文标题