Flexblock：具有多模式块浮点支撑的灵活的DNN训练加速器

论文标题

Flexblock：具有多模式块浮点支撑的灵活的DNN训练加速器

FlexBlock: A Flexible DNN Training Accelerator with Multi-Mode Block Floating Point Support

论文作者

Noh, Seock-Hwan, Koo, Jahyun, Lee, Seunghyun, Park, Jongse, Kung, Jaeha

论文摘要

培训深度神经网络（DNNS）是一项计算昂贵的工作，即使使用高性能GPU，也可能需要数周或数月。作为解决这一挑战的一种补救措施，社区已经开始探索在培训过程中使用更有效的数据表示形式的使用，例如块浮点（BFP）。但是，基于BFP的DNN加速器的先前工作依赖于特定的BFP表示，从而使它们的用途较低。本文建立在算法观察的基础上，即我们可以通过利用多个BFP精确度来加速培训，而不会损害最终达到的准确性。在这种算法机会的支持下，我们开发了一个柔性DNN训练加速器，称为FlexBlock，该训练板支持三种不同的BFP精度模式，在激活，重量和梯度张量之间可能有所不同。虽然几项先前的作品提出了对DNN加速器的多次支撑，但它们不仅专注于推论，而且在考虑训练时，它们的核心利用率在固定的精度和特定层类型上是次优的。取而代之的是，FlexBlock的设计方式是可以实现i）各种层类型的高核利用率，而ii）三个BFP精确度通过以层次结构方式映射到其计算单元。我们使用CIFAR，ImageNet和WMT14数据集上的众所周知的DNN来评估Flexblock体系结构的有效性。结果，与其他训练加速器相比，与其他训练加速器相比，弹性板的训练显着提高了1.5〜5.3倍，能源效率平均提高了2.4〜7.0倍，并且与全精度训练相比会造成边际准确性损失。

Training deep neural networks (DNNs) is a computationally expensive job, which can take weeks or months even with high performance GPUs. As a remedy for this challenge, community has started exploring the use of more efficient data representations in the training process, e.g., block floating point (BFP). However, prior work on BFP-based DNN accelerators rely on a specific BFP representation making them less versatile. This paper builds upon an algorithmic observation that we can accelerate the training by leveraging multiple BFP precisions without compromising the finally achieved accuracy. Backed up by this algorithmic opportunity, we develop a flexible DNN training accelerator, dubbed FlexBlock, which supports three different BFP precision modes, possibly different among activation, weight, and gradient tensors. While several prior works proposed such multi-precision support for DNN accelerators, not only do they focus only on the inference, but also their core utilization is suboptimal at a fixed precision and specific layer types when the training is considered. Instead, FlexBlock is designed in such a way that high core utilization is achievable for i) various layer types, and ii) three BFP precisions by mapping data in a hierarchical manner to its compute units. We evaluate the effectiveness of FlexBlock architecture using well-known DNNs on CIFAR, ImageNet and WMT14 datasets. As a result, training in FlexBlock significantly improves the training speed by 1.5~5.3x and the energy efficiency by 2.4~7.0x on average compared to other training accelerators and incurs marginal accuracy loss compared to full-precision training.

下载PDF全文

下载文献需遵守相关版权规定

论文标题