MixMae：混合和掩盖的自动编码器，可有效预测分层视觉变压器

论文标题

MixMae：混合和掩盖的自动编码器，可有效预测分层视觉变压器

MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision Transformers

论文作者

Liu, Jihao, Huang, Xin, Zheng, Jinliang, Liu, Yu, Li, Hongsheng

论文摘要

在本文中，我们提出了混合和掩盖的自动编码器（MixMae），这是一种适用于各种分层视觉变压器的简单但有效的预审进方法。层次视觉变压器的现有掩蔽图像建模（MIM）方法用特殊的[掩码]符号代替了输入令牌的随机子集，并旨在从损坏的图像中重建原始图像令牌。但是，我们发现，由于较大的掩盖率（例如，在simmim中，使用[蒙版）符号大大减慢了训练，并导致预处理前后不一致（例如60％）。另一方面，MAE根本不会在其编码器上引入[掩码]令牌，但不适用于分层视觉变压器。为了解决该问题并加速层次模型的预处理，我们用另一个图像的可见令牌（即创建混合图像）替换了一个图像的掩盖令牌。然后，我们进行双重重建，从混合输入中重建两个原始图像，从而显着提高效率。虽然MixMae可以应用于各种层次变压器，但本文使用具有较大窗口大小的Swin Transformer探索，并扩展到巨大的模型大小（达到600m参数）。经验结果表明，混音群可以有效地学习高质量的视觉表示。值得注意的是，使用SWIN-B/W14的混合物通过在600个时期预处理ImageNet-1k上达到了85.1％的TOP-1精度。此外，其在其他6个数据集中的转移表演表明，MixMae比以前流行的MIM方法具有更好的失败 /性能权衡。代码可从https://github.com/sense-x/mixmim获得。

In this paper, we propose Mixed and Masked AutoEncoder (MixMAE), a simple but efficient pretraining method that is applicable to various hierarchical Vision Transformers. Existing masked image modeling (MIM) methods for hierarchical Vision Transformers replace a random subset of input tokens with a special [MASK] symbol and aim at reconstructing original image tokens from the corrupted image. However, we find that using the [MASK] symbol greatly slows down the training and causes pretraining-finetuning inconsistency, due to the large masking ratio (e.g., 60% in SimMIM). On the other hand, MAE does not introduce [MASK] tokens at its encoder at all but is not applicable for hierarchical Vision Transformers. To solve the issue and accelerate the pretraining of hierarchical models, we replace the masked tokens of one image with visible tokens of another image, i.e., creating a mixed image. We then conduct dual reconstruction to reconstruct the two original images from the mixed input, which significantly improves efficiency. While MixMAE can be applied to various hierarchical Transformers, this paper explores using Swin Transformer with a large window size and scales up to huge model size (to reach 600M parameters). Empirical results demonstrate that MixMAE can learn high-quality visual representations efficiently. Notably, MixMAE with Swin-B/W14 achieves 85.1% top-1 accuracy on ImageNet-1K by pretraining for 600 epochs. Besides, its transfer performances on the other 6 datasets show that MixMAE has better FLOPs / performance tradeoff than previous popular MIM methods. Code is available at https://github.com/Sense-X/MixMIM.

下载PDF全文

下载文献需遵守相关版权规定

论文标题