视觉变压器计算和动态推理的弹性

论文标题

视觉变压器计算和动态推理的弹性

Vision Transformer Computation and Resilience for Dynamic Inference

论文作者

Sreedhar, Kavya, Clemons, Jason, Venkatesan, Rangharajan, Keckler, Stephen W., Horowitz, Mark

论文摘要

用于计算机视觉任务的最新深度学习模型基于变压器体系结构，并且经常部署在实时应用程序中。在这种情况下，每个推论可用的资源可能会有所不同，因此能够动态执行以提高精度以提高效率很有用。为了创建动态模型，我们利用视觉变压器的弹性来修剪和在模型的不同缩放版本之间切换。令人惊讶的是，我们发现大多数拖鞋是由卷积而不是注意产生的。这些相对的失败计数不是GPU性能的良好预测指标，因为GPU具有特殊的卷积优化。一些模型是相当弹性的，并且可以在不进行重新培训的情况下对其模型执行进行调整，而所有模型都可以通过重新培训替代执行路径获得更好的准确性。这些见解意味着我们可以利用CNN加速器和这些替代性执行路径来实现高效而动态的视觉变压器推断。我们的分析表明，利用这种动态执行可能会导致Segformer（63 Gflops）的1.4 \％精度下降的能源（63 GFLOPS），而没有额外的训练，而Resnet-50（4 Gflops）的能量的53 \％\％\％\％\％，并通过在预定的一个for for for-flover press中切换3.3 \％的精度下降。

State-of-the-art deep learning models for computer vision tasks are based on the transformer architecture and often deployed in real-time applications. In this scenario, the resources available for every inference can vary, so it is useful to be able to dynamically adapt execution to trade accuracy for efficiency. To create dynamic models, we leverage the resilience of vision transformers to pruning and switch between different scaled versions of a model. Surprisingly, we find that most FLOPs are generated by convolutions, not attention. These relative FLOP counts are not a good predictor of GPU performance since GPUs have special optimizations for convolutions. Some models are fairly resilient and their model execution can be adapted without retraining, while all models achieve better accuracy with retraining alternative execution paths. These insights mean that we can leverage CNN accelerators and these alternative execution paths to enable efficient and dynamic vision transformer inference. Our analysis shows that leveraging this type of dynamic execution can lead to saving 28\% of energy with a 1.4\% accuracy drop for SegFormer (63 GFLOPs), with no additional training, and 53\% of energy for ResNet-50 (4 GFLOPs) with a 3.3\% accuracy drop by switching between pretrained Once-For-All models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题