REST V2：更简单，更快，更强壮

论文标题

REST V2：更简单，更快，更强壮

ResT V2: Simpler, Faster and Stronger

论文作者

Zhang, Qing-Long, Yang, Yu-Bin

论文摘要

本文提出了RESTV2，这是一种更简单，更快，更强的多尺度视觉变压器，以进行视觉识别。 RESTV2简化了RESTV1中的EMSA结构（即消除了多头相互作用零件），并采用了upplame操作来重建由下采样操作引起的丢失的中等和高频信息。此外，我们探索了不同的技术，以更好地将RESTV2骨架应用于下游任务。我们发现，尽管将EMSAV2和窗户注意力结合起来可以大大减少理论矩阵倍增插槽，但它可能会大大降低计算密度，从而导致较低的实际速度。我们全面验证RESTV2在ImageNet分类，可可检测和ADE20K语义分割方面。实验结果表明，所提出的RESTV2可以大幅度胜过最近最新的骨架，这表明RESTV2作为固体骨架的潜力。代码和模型将在\ url {https://github.com/wofmanaf/rest}公开可用

This paper proposes ResTv2, a simpler, faster, and stronger multi-scale vision Transformer for visual recognition. ResTv2 simplifies the EMSA structure in ResTv1 (i.e., eliminating the multi-head interaction part) and employs an upsample operation to reconstruct the lost medium- and high-frequency information caused by the downsampling operation. In addition, we explore different techniques for better apply ResTv2 backbones to downstream tasks. We found that although combining EMSAv2 and window attention can greatly reduce the theoretical matrix multiply FLOPs, it may significantly decrease the computation density, thus causing lower actual speed. We comprehensively validate ResTv2 on ImageNet classification, COCO detection, and ADE20K semantic segmentation. Experimental results show that the proposed ResTv2 can outperform the recently state-of-the-art backbones by a large margin, demonstrating the potential of ResTv2 as solid backbones. The code and models will be made publicly available at \url{https://github.com/wofmanaf/ResT}

下载PDF全文

下载文献需遵守相关版权规定

论文标题