LearningGroup：通过可学习的重量分组用于多机构强化学习，对FPGA进行实时稀疏培训

论文标题

LearningGroup：通过可学习的重量分组用于多机构强化学习，对FPGA进行实时稀疏培训

LearningGroup: A Real-Time Sparse Training on FPGA via Learnable Weight Grouping for Multi-Agent Reinforcement Learning

论文作者

Yang, Je, Kim, JaeUk, Kim, Joo-Young

论文摘要

多机构增强学习（MARL）是一项强大的技术，可以在多种机器人控制和自动驾驶汽车等各种应用中构建交互式人工智能系统。与主动利用网络修剪的有监督模型或单人强化学习不同，修剪如何在多代理强化学习中使用其合作和互动特征在多代理增强学习中起作用。 \在本文中，我们提出了一个名为LearningGroup的实时稀疏培训加速系统，该系统首次采用算法/架构共同设计方法对MARL进行修剪。我们使用权重分组算法创建稀疏性，并提出片上稀疏数据编码循环（OSEL），该数据可以通过有效实现来快速编码。根据OSEL的编码格式，LearningGroup执行有效的权重压缩和计算工作负载分配给多个内核，在该核心中，每个核心都会通过向量处理单元同时处理重量矩阵的多个稀疏行。结果，LearningGroup系统最小化了稀疏数据生成高达5.72倍和6.81倍的周期时间和内存足迹。它的FPGA加速器显示257.40-3629.48 GFLOPS吞吐量和MARL的各种条件的7.10-100.12 Gflops/w能源效率，这些条件高7.13倍，而NVIDIA TITAN RTX RTX GPU则高出12.43倍，这要归功于全面的chr培训和高度优化的数据流量/数据流量/数据流量/数据流量的数据流量。最重要的是，加速器显示了高达12.52倍的加速器，用于处理密集的情况下稀疏的数据，这是最先进的稀疏训练加速器中最高的数据。

Multi-agent reinforcement learning (MARL) is a powerful technology to construct interactive artificial intelligent systems in various applications such as multi-robot control and self-driving cars. Unlike supervised model or single-agent reinforcement learning, which actively exploits network pruning, it is obscure that how pruning will work in multi-agent reinforcement learning with its cooperative and interactive characteristics. \par In this paper, we present a real-time sparse training acceleration system named LearningGroup, which adopts network pruning on the training of MARL for the first time with an algorithm/architecture co-design approach. We create sparsity using a weight grouping algorithm and propose on-chip sparse data encoding loop (OSEL) that enables fast encoding with efficient implementation. Based on the OSEL's encoding format, LearningGroup performs efficient weight compression and computation workload allocation to multiple cores, where each core handles multiple sparse rows of the weight matrix simultaneously with vector processing units. As a result, LearningGroup system minimizes the cycle time and memory footprint for sparse data generation up to 5.72x and 6.81x. Its FPGA accelerator shows 257.40-3629.48 GFLOPS throughput and 7.10-100.12 GFLOPS/W energy efficiency for various conditions in MARL, which are 7.13x higher and 12.43x more energy efficient than Nvidia Titan RTX GPU, thanks to the fully on-chip training and highly optimized dataflow/data format provided by FPGA. Most importantly, the accelerator shows speedup up to 12.52x for processing sparse data over the dense case, which is the highest among state-of-the-art sparse training accelerators.

下载PDF全文

下载文献需遵守相关版权规定

论文标题