将基于收缩期阵列的硬件加速器集成到DNN操作员自动调整框架中

论文标题

将基于收缩期阵列的硬件加速器集成到DNN操作员自动调整框架中

Integration of a systolic array based hardware accelerator into a DNN operator auto-tuning framework

论文作者

Peccia, F. N., Bringmann, O.

论文摘要

神经网络在异质SOC上的部署以及定制加速器的部署是一项艰巨的任务，因为缺乏为这些系统提供的端到端软件工具。此外，加速器开发人员为典型张量操作提供的已经可用的低级计划和映射策略不一定是每个特定用例的最佳方法。这就是为什么自动在特定硬件配置上自动测试生成代码性能的框架引起了特别的兴趣。在这项工作中，介绍了代码生成框架TVM与基于收缩期数组的加速器Gemmini之间的集成。详细详细介绍了将一般矩阵乘数（GEMM）张量操作卸载（GEMM）张量操作的通用时间表，并通过在其上执行AUTOTVM调谐过程来测试其适用性。我们生成的代码在Xilinx ZCU102 FPGA上的100 MHz时钟下实现了每秒46个千兆手术的峰值吞吐量，表现优于先前的工作。此外，此集成生成的代码能够超过Gemmini开发人员在实际工作负载中提供的默认手动调整时间表。

The deployment of neural networks on heterogeneous SoCs coupled with custom accelerators is a challenging task because of the lack of end-to-end software tools provided for these systems. Moreover, the already available low level schedules and mapping strategies provided by the accelerator developers for typical tensor operations are not necessarily the best possible ones for each particular use case. This is why frameworks which automatically test the performance of the generated code on a specific hardware configuration are of special interest. In this work, the integration between the code generation framework TVM and the systolic array-based accelerator Gemmini is presented. A generic schedule to offload the GEneral Matrix Multiply (GEMM) tensor operation onto Gemmini is detailed, and its suitability is tested by executing the AutoTVM tuning process on it. Our generated code achieves a peak throughput of 46 giga-operations per second (GOPs) under a 100 MHz clock on a Xilinx ZCU102 FPGA, outperforming previous work. Furthermore, the code generated by this integration was able to surpass the default hand-tuned schedules provided by the Gemmini developers in real-world workloads.

下载PDF全文

下载文献需遵守相关版权规定

论文标题