论文标题
有效的无注意视频变压器
Efficient Attention-free Video Shift Transformers
论文作者
论文摘要
本文解决了有效的视频识别问题。在这一领域,视频变压器最近在效率(Top-1精确度与Flops)频谱中占据了主导地位。同时,在图像域中进行了一些尝试,这些尝试挑战了变压器体系结构中自我发挥操作的必要性,主张使用更简单的方法来进行令牌混合。但是,视频识别的情况尚无结果,在这种情况下,自我发项操作员对效率的影响(与图像的情况相比)显着更高。为了解决这一差距,在本文中,我们做出以下贡献:(a)我们基于移位操作员,造型的仿射转移块构建了一个高效\&准确的无注意块,该块是专门设计的,该块专门为变压器层的MHSA块中的操作尽可能接近近似。基于我们的仿射转移块,我们构建了我们的仿射转移变压器,并表明它已经超过了所有现有的基于移位/MLP的架构进行Imagenet分类。 (b)我们将公式扩展到视频域中,以构建视频播客变压器(vast),这是第一个纯粹无注意的基于偏移的视频变压器。 (c)我们表明,对于最流行的动作识别基准,对于具有低计算和内存足迹的模型而言,巨大的最新变压器的表现明显优于最新的最新变压器。代码将提供。
This paper tackles the problem of efficient video recognition. In this area, video transformers have recently dominated the efficiency (top-1 accuracy vs FLOPs) spectrum. At the same time, there have been some attempts in the image domain which challenge the necessity of the self-attention operation within the transformer architecture, advocating the use of simpler approaches for token mixing. However, there are no results yet for the case of video recognition, where the self-attention operator has a significantly higher impact (compared to the case of images) on efficiency. To address this gap, in this paper, we make the following contributions: (a) we construct a highly efficient \& accurate attention-free block based on the shift operator, coined Affine-Shift block, specifically designed to approximate as closely as possible the operations in the MHSA block of a Transformer layer. Based on our Affine-Shift block, we construct our Affine-Shift Transformer and show that it already outperforms all existing shift/MLP--based architectures for ImageNet classification. (b) We extend our formulation in the video domain to construct Video Affine-Shift Transformer (VAST), the very first purely attention-free shift-based video transformer. (c) We show that VAST significantly outperforms recent state-of-the-art transformers on the most popular action recognition benchmarks for the case of models with low computational and memory footprint. Code will be made available.