多合一：具有动态功率管理的边缘设备的高度代表性的DNN修剪框架

论文标题

多合一：具有动态功率管理的边缘设备的高度代表性的DNN修剪框架

All-in-One: A Highly Representative DNN Pruning Framework for Edge Devices with Dynamic Power Management

论文作者

Gong, Yifan, Zhan, Zheng, Zhao, Pu, Wu, Yushu, Wu, Chao, Ding, Caiwen, Jiang, Weiwen, Qin, Minghai, Wang, Yanzhi

论文摘要

在边缘设备上的深神经网络（DNN）部署期间，许多研究工作专门用于有限的硬件资源。但是，很少关注动态功率管理的影响。由于边缘设备通常只有电池的能量预算（而不是在服务器或工作站上几乎无限的能源支持），因此它们的动态功率管理通常会改变执行频率，因为在广泛使用的动态电压和频率缩放（DVFS）技术中。这会导致高度不稳定的推理速度性能，尤其是对于计算密集型DNN模型，这可能会损害用户体验和浪费硬件资源。我们首先确定这个问题，然后提出多合一的问题，这是一个高度代表性的修剪框架，可以使用DVF与动态功率管理一起工作。该框架只能使用一组模型权重和软面膜（以及其他可忽略的存储的辅助参数）来表示各种修剪比的多种模型。通过将模型重新配置为特定的执行频率（和电压）的相应修剪比，我们能够实现稳定的推理速度，即，在尽可能小的各个执行频率下保持速度性能的差异。我们的实验表明，我们的方法不仅可以达到不同修剪比的多种模型的高度精度，而且还降低了它们在各种频率的推理潜伏期方差，只有一种模型和一个软掩模的最小记忆消耗。

During the deployment of deep neural networks (DNNs) on edge devices, many research efforts are devoted to the limited hardware resource. However, little attention is paid to the influence of dynamic power management. As edge devices typically only have a budget of energy with batteries (rather than almost unlimited energy support on servers or workstations), their dynamic power management often changes the execution frequency as in the widely-used dynamic voltage and frequency scaling (DVFS) technique. This leads to highly unstable inference speed performance, especially for computation-intensive DNN models, which can harm user experience and waste hardware resources. We firstly identify this problem and then propose All-in-One, a highly representative pruning framework to work with dynamic power management using DVFS. The framework can use only one set of model weights and soft masks (together with other auxiliary parameters of negligible storage) to represent multiple models of various pruning ratios. By re-configuring the model to the corresponding pruning ratio for a specific execution frequency (and voltage), we are able to achieve stable inference speed, i.e., keeping the difference in speed performance under various execution frequencies as small as possible. Our experiments demonstrate that our method not only achieves high accuracy for multiple models of different pruning ratios, but also reduces their variance of inference latency for various frequencies, with minimal memory consumption of only one model and one soft mask.

下载PDF全文

下载文献需遵守相关版权规定

论文标题