论文标题
TRT-VIT:面向浓度的视觉变压器
TRT-ViT: TensorRT-oriented Vision Transformer
论文作者
论文摘要
我们从实际应用的角度重新审视了现有的出色变压器。他们中的大多数甚至不如基本重新Nets系列效率,并且偏离了现实的部署方案。这可能是由于当前的标准测量计算效率,例如flops或参数是单方面的,次优的且对硬件不敏感的。因此,本文将特定硬件上的紧张延迟视为效率指标,该指标提供了涉及计算能力,内存成本和带宽的更全面的反馈。基于一系列受控实验,这项工作为面向强和部署的网络设计提供了四个实用指南,例如,在阶段级别,早期的变压器和晚期CNN,在Block Level处,早期CNN和Late Transformer。因此,提出了一个面向Tensortrt的变压器家族,缩写为TRT-VIT。广泛的实验表明,在不同的视觉任务(例如,图像分类,对象检测和语义细分)之间,TRT-VIT显着优于现有的convnet和视觉变压器。例如,在82.7%的Imagenet-1k Top-1精度下,TRT-Vit的速度比CSWIN快2.7 $ \ tims $,而2.0 $ \ times $ \ times $均比双胞胎快。在MS-COCO对象检测任务上,TRT-VIT与双胞胎实现了可比的性能,而推理速度则增加了2.8 $ \ times $。
We revisit the existing excellent Transformers from the perspective of practical application. Most of them are not even as efficient as the basic ResNets series and deviate from the realistic deployment scenario. It may be due to the current criterion to measure computation efficiency, such as FLOPs or parameters is one-sided, sub-optimal, and hardware-insensitive. Thus, this paper directly treats the TensorRT latency on the specific hardware as an efficiency metric, which provides more comprehensive feedback involving computational capacity, memory cost, and bandwidth. Based on a series of controlled experiments, this work derives four practical guidelines for TensorRT-oriented and deployment-friendly network design, e.g., early CNN and late Transformer at stage-level, early Transformer and late CNN at block-level. Accordingly, a family of TensortRT-oriented Transformers is presented, abbreviated as TRT-ViT. Extensive experiments demonstrate that TRT-ViT significantly outperforms existing ConvNets and vision Transformers with respect to the latency/accuracy trade-off across diverse visual tasks, e.g., image classification, object detection and semantic segmentation. For example, at 82.7% ImageNet-1k top-1 accuracy, TRT-ViT is 2.7$\times$ faster than CSWin and 2.0$\times$ faster than Twins. On the MS-COCO object detection task, TRT-ViT achieves comparable performance with Twins, while the inference speed is increased by 2.8$\times$.