论文标题
是什么使视觉变压器中的良好的引物器呢?
What Makes for Good Tokenizers in Vision Transformer?
论文作者
论文摘要
变形金刚的架构最近见证了视觉任务中蓬勃发展的应用,它与广泛的卷积范式枢纽。依靠将输入输入分为多个令牌的令牌化过程,变压器能够使用自我注意提取成对关系。在变压器的茎构建块的同时,使得良好的令牌剂在计算机视觉中尚未得到充分理解。在这项工作中,我们从信息权衡的角度研究了这个未知的问题。除了统一和理解现有的结构修饰外,我们的推导还为视觉引导者提供了更好的设计策略。跨令牌(Moto)提出的调制通过归一化结合了token绕的建模能力。此外,标准培训制度中还接受了正规化目标代肯普罗普。通过对各种变压器体系结构进行的广泛实验,我们观察到了这两种插件设计的性能和有趣的特性,并具有可忽略的计算开销。这些观察结果进一步表明了象征变压器中常见的代币设计设计的重要性。
The architecture of transformers, which recently witness booming applications in vision tasks, has pivoted against the widespread convolutional paradigm. Relying on the tokenization process that splits inputs into multiple tokens, transformers are capable of extracting their pairwise relationships using self-attention. While being the stemming building block of transformers, what makes for a good tokenizer has not been well understood in computer vision. In this work, we investigate this uncharted problem from an information trade-off perspective. In addition to unifying and understanding existing structural modifications, our derivation leads to better design strategies for vision tokenizers. The proposed Modulation across Tokens (MoTo) incorporates inter-token modeling capability through normalization. Furthermore, a regularization objective TokenProp is embraced in the standard training regime. Through extensive experiments on various transformer architectures, we observe both improved performance and intriguing properties of these two plug-and-play designs with negligible computational overhead. These observations further indicate the importance of the commonly-omitted designs of tokenizers in vision transformer.