蒙版的自动编码器是用于变压器数据的有效解决方案

论文标题

蒙版的自动编码器是用于变压器数据的有效解决方案

Masked autoencoders are effective solution to transformer data-hungry

论文作者

Mao, Jiawei, Zhou, Honggu, Yin, Xuesong, Xu, Yuanqi Chang. Binling Nie. Rui

论文摘要

视觉变形金刚（VIT）在具有全球建模功能的几个视力任务中优于卷积神经网络（CNN）。但是，VIT缺乏卷积固有的电感偏差，因此它需要大量的训练数据。这会导致VIT不像医学和科学等小型数据集上的CNN相同。我们通过实验发现，蒙面的自动编码器（MAE）可以使变压器更多地关注图像本身，从而减轻VIT的数据渴望问题。然而，当前的MAE模型过于复杂，导致小数据集上的过度问题。这导致了在小型数据集上训练的MAE与高级CNNS模型之间的差距。因此，我们研究了如何降低MAE中的解码器复杂性，并在小型数据集中发现了更合适的架构配置。此外，我们还设计了一项位置预测任务和对比的学习任务，以引入MAE的本地化和不变特征。我们的对比学习任务不仅使模型能够学习高级视觉信息，而且还可以训练MAE的类令牌。这是大多数MAE改进工作都不考虑的事情。广泛的实验表明，与当前流行的小型数据集中的流行流行掩蔽图像建模（MIM）和视觉变压器相比，我们的方法显示了标准小数据集的最先进性能，以及很少的样品的医疗数据集。

Vision Transformers (ViTs) outperforms convolutional neural networks (CNNs) in several vision tasks with its global modeling capabilities. However, ViT lacks the inductive bias inherent to convolution making it require a large amount of data for training. This results in ViT not performing as well as CNNs on small datasets like medicine and science. We experimentally found that masked autoencoders (MAE) can make the transformer focus more on the image itself, thus alleviating the data-hungry issue of ViT to some extent. Yet the current MAE model is too complex resulting in over-fitting problems on small datasets. This leads to a gap between MAEs trained on small datasets and advanced CNNs models still. Therefore, we investigated how to reduce the decoder complexity in MAE and found a more suitable architectural configuration for it with small datasets. Besides, we additionally designed a location prediction task and a contrastive learning task to introduce localization and invariance characteristics for MAE. Our contrastive learning task not only enables the model to learn high-level visual information but also allows the training of MAE's class token. This is something that most MAE improvement efforts do not consider. Extensive experiments have shown that our method shows state-of-the-art performance on standard small datasets as well as medical datasets with few samples compared to the current popular masked image modeling (MIM) and vision transformers for small datasets.The code and models are available at https://github.com/Talented-Q/SDMAE.

下载PDF全文

下载文献需遵守相关版权规定

论文标题