论文标题
通过视觉变压器学习明确的以对象为中心的表示
Learning Explicit Object-Centric Representations with Vision Transformers
论文作者
论文摘要
随着变形金刚最近成功地适应视觉领域,尤其是在以自我监督的方式进行训练时,已经表明,视觉变形金刚可以学习令人印象深刻的类似对象的行为,并为图像中对象细分的任务表现出表现力。在本文中,我们基于掩盖自动编码的自学任务,并探索了其具有变压器明确学习以对象为中心表示的有效性。为此,我们仅使用变压器设计一个以对象为中心的自动编码器,并端到端训练它以从未掩盖的补丁中重建完整图像。我们表明,该模型有效地学会了通过在多个多对象基准上进行分割指标来分解简单场景。
With the recent successful adaptation of transformers to the vision domain, particularly when trained in a self-supervised fashion, it has been shown that vision transformers can learn impressive object-reasoning-like behaviour and features expressive for the task of object segmentation in images. In this paper, we build on the self-supervision task of masked autoencoding and explore its effectiveness for explicitly learning object-centric representations with transformers. To this end, we design an object-centric autoencoder using transformers only and train it end-to-end to reconstruct full images from unmasked patches. We show that the model efficiently learns to decompose simple scenes as measured by segmentation metrics on several multi-object benchmarks.