论文标题
探索视觉表示的随机自回归图像建模
Exploring Stochastic Autoregressive Image Modeling for Visual Representation
论文作者
论文摘要
自学语言建模(ALM)已成功用于自然语言处理(NLP)的预训练。但是,这种范式尚未与计算机视觉中的其他自我监督方法(例如对比度学习,掩盖图像建模)获得可比的结果。在本文中,我们试图找到自回旋建模在视觉任务上无法正常工作的原因。为了解决这个问题,我们充分分析了视觉自回归方法的局限性,并提出了一种新型的随机自回归图像建模(命名为SAIM)。首先,我们采用随机置换策略来产生有效且可靠的图像上下文,这对于视觉任务至关重要。其次,我们创建了一个并行编码器训练过程,其中编码器与标准视觉变压器的关注对学习整个上下文信息的关注相似,同时解码器预测了当前位置的内容,以便编码器和解码器可以相互加强。通过引入随机预测和并行编码器描述器,SAIM显着提高了自回归图像建模的性能。我们的方法仅使用ImageNet-1k数据就可以在方法中获得最佳的香草vit-base模型(83.9%)。下游任务中的转移性能也表明我们的模型可以达到竞争性能。
Autoregressive language modeling (ALM) have been successfully used in self-supervised pre-training in Natural language processing (NLP). However, this paradigm has not achieved comparable results with other self-supervised approach in computer vision (e.g., contrastive learning, mask image modeling). In this paper, we try to find the reason why autoregressive modeling does not work well on vision tasks. To tackle this problem, we fully analyze the limitation of visual autoregressive methods and proposed a novel stochastic autoregressive image modeling (named SAIM) by the two simple designs. First, we employ stochastic permutation strategy to generate effective and robust image context which is critical for vision tasks. Second, we create a parallel encoder-decoder training process in which the encoder serves a similar role to the standard vision transformer focus on learning the whole contextual information, and meanwhile the decoder predicts the content of the current position, so that the encoder and decoder can reinforce each other. By introducing stochastic prediction and the parallel encoder-decoder, SAIM significantly improve the performance of autoregressive image modeling. Our method achieves the best accuracy (83.9%) on the vanilla ViT-Base model among methods using only ImageNet-1K data. Transfer performance in downstream tasks also show that our model achieves competitive performance.