霓虹灯：基于电阻RAM的神经网络加速器中的非线性操作有效支持

论文标题

霓虹灯：基于电阻RAM的神经网络加速器中的非线性操作有效支持

NEON: Enabling Efficient Support for Nonlinear Operations in Resistive RAM-based Neural Network Accelerators

论文作者

Manglik, Aditya, Patel, Minesh, Mao, Haiyu, Salami, Behzad, Park, Jisung, Orosa, Lois, Mutlu, Onur

论文摘要

电阻随机访问记忆（RRAM）非常适合加速神经网络（NN）工作负载，作为基于RRAM的内存处理（PIM）体系结构本质上支持高度平行的多重蓄能（MAC）操作，构成了大多数NN工作负载的骨干。不幸的是，诸如变形金刚之类的NN工作负载需要支持RRAM无法本地提供的非MAC操作（例如SoftMax）。因此，最先进的工作要么集成了额外的数字逻辑电路，以支持非MAC操作，要么将非MAC操作卸载到CPU/GPU，从而导致由于数据移动而导致的性能和能源效率的大量高架。在这项工作中，我们提出了霓虹灯，这是一种新颖的编译器优化，以实现RRAM中NN工作量的端到端执行。霓虹灯的关键思想是将每个非MAC操作转变为轻巧但高度准确的神经网络。利用神经网络近似非MAC操作提供了两个优点：1）我们可以利用RRAM的关键强度，即高度并行MAC的操作，以灵活有效地在内存中执行非MAC操作。 2）我们可以通过消除其他数字逻辑电路来简化RRAM的微体系结构，同时减少数据移动开销。与理想化的基于数字逻辑的RRAM相比，内存中非MAC操作的加速使霓虹灯达到了2.28倍的速度。我们分析了与转换相关的权衡，并证明了跨不同底物的霓虹灯的可行用例。

Resistive Random-Access Memory (RRAM) is well-suited to accelerate neural network (NN) workloads as RRAM-based Processing-in-Memory (PIM) architectures natively support highly-parallel multiply-accumulate (MAC) operations that form the backbone of most NN workloads. Unfortunately, NN workloads such as transformers require support for non-MAC operations (e.g., softmax) that RRAM cannot provide natively. Consequently, state-of-the-art works either integrate additional digital logic circuits to support the non-MAC operations or offload the non-MAC operations to CPU/GPU, resulting in significant performance and energy efficiency overheads due to data movement. In this work, we propose NEON, a novel compiler optimization to enable the end-to-end execution of the NN workload in RRAM. The key idea of NEON is to transform each non-MAC operation into a lightweight yet highly-accurate neural network. Utilizing neural networks to approximate the non-MAC operations provides two advantages: 1) We can exploit the key strength of RRAM, i.e., highly-parallel MAC operation, to flexibly and efficiently execute non-MAC operations in memory. 2) We can simplify RRAM's microarchitecture by eliminating the additional digital logic circuits while reducing the data movement overheads. Acceleration of the non-MAC operations in memory enables NEON to achieve a 2.28x speedup compared to an idealized digital logic-based RRAM. We analyze the trade-offs associated with the transformation and demonstrate feasible use cases for NEON across different substrates.

下载PDF全文

下载文献需遵守相关版权规定

论文标题