CAE V2：带有夹目标的上下文自动编码器

论文标题

CAE V2：带有夹目标的上下文自动编码器

CAE v2: Context Autoencoder with CLIP Target

论文作者

Zhang, Xinyu, Chen, Jiahui, Yuan, Junkun, Chen, Qiang, Wang, Jian, Wang, Xiaodi, Han, Shumin, Chen, Xiaokang, Pi, Jimin, Yao, Kun, Han, Junyu, Ding, Errui, Wang, Jingdong

论文摘要

蒙版图像建模（MIM）通过掩盖和重建图像贴片来学习视觉表示。在剪辑表示上应用重建监督已被证明对MIM有效。但是，在MIM中剪辑的监督如何影响性能，但仍未探索它。为了调查精炼剪辑靶向的MIM的策略，我们研究了MIM中的两个关键要素，即监督位置和掩码比率，并揭示了两个有趣的观点，这些观点依赖于我们开发的简单管道，即带有夹子目标的上下文AutoDecoder（CAE V2）。首先，我们观察到，可见贴片的监督可以实现出色的性能，甚至比在蒙版贴片上更好的表现，后者是现有MIM方法中的标准格式。其次，最佳掩码比与模型大小正相关。也就是说，模型越小，掩模比率越低。在这两个发现的驱动下，我们简单而简洁的方法CAE V2在一系列下游任务上取得了出色的表现。例如，在线性探测和Imagenet-1k上的线性探测和微调时，香草vit-large模型可在300个时期进行300个时期的ADE20K上实现81.7％和86.7％的TOP-1准确性，而在ADE20K上进行了55.9％MIOU。我们希望我们的发现可以成为MIM区域预训练的有用指南，尤其是对于小型模型。

Masked image modeling (MIM) learns visual representation by masking and reconstructing image patches. Applying the reconstruction supervision on the CLIP representation has been proven effective for MIM. However, it is still under-explored how CLIP supervision in MIM influences performance. To investigate strategies for refining the CLIP-targeted MIM, we study two critical elements in MIM, i.e., the supervision position and the mask ratio, and reveal two interesting perspectives, relying on our developed simple pipeline, context autodecoder with CLIP target (CAE v2). Firstly, we observe that the supervision on visible patches achieves remarkable performance, even better than that on masked patches, where the latter is the standard format in the existing MIM methods. Secondly, the optimal mask ratio positively correlates to the model size. That is to say, the smaller the model, the lower the mask ratio needs to be. Driven by these two discoveries, our simple and concise approach CAE v2 achieves superior performance on a series of downstream tasks. For example, a vanilla ViT-Large model achieves 81.7% and 86.7% top-1 accuracy on linear probing and fine-tuning on ImageNet-1K, and 55.9% mIoU on semantic segmentation on ADE20K with the pre-training for 300 epochs. We hope our findings can be helpful guidelines for the pre-training in the MIM area, especially for the small-scale models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题