线性映射从图像到文本空间

论文标题

线性映射从图像到文本空间

Linearly Mapping from Image to Text Space

论文作者

Merullo, Jack, Castricato, Louis, Eickhoff, Carsten, Pavlick, Ellie

论文摘要

纯文本语言模型（LMS）学会代表非语言世界的特征的程度是一个悬而未决的问题。先前的工作表明，当视觉模型的参数被优化以编码语言空间中的图像时，可以教审计的LMS进行标题图像。我们检验了一个更强的假设：通过冻结的文本模型和仅视觉模型学到的概念表示足够相似，以至于可以通过线性图来实现这一点。我们表明，来自视觉模型的图像表示形式可以作为连续提示传递，以通过仅训练单个线性投影来冷冻LM。使用这些来提示LM在字幕上和视觉问题回答任务上取得了竞争性能，与调整图像编码器和文本解码器（例如岩浆模型）的模型相比。我们将三个图像编码器与预处理期间的语言监督量增加：BEIT（无语言信息），NF-Resnet（词汇类别信息）和剪辑（完整的自然语言描述）。我们发现，所有三个编码器在将视觉属性信息传输到语言模型（例如，动物是大还是小）方面的表现同样出色，但是用语言监督预测的图像更明显地编码类别信息（例如，区分Hippo vs. Hippo vs. Elephant），因此在Benchmark语言和Vision-vision Tasks上表现出色。我们的结果表明，LMS在结构上与基于视觉的模型相似，即使是仅在图像上训练的概念信息。代码可在此处找到：https：//github.com/jmerullo/libber

The extent to which text-only language models (LMs) learn to represent features of the non-linguistic world is an open question. Prior work has shown that pretrained LMs can be taught to caption images when a vision model's parameters are optimized to encode images in the language space. We test a stronger hypothesis: that the conceptual representations learned by frozen text-only models and vision-only models are similar enough that this can be achieved with a linear map. We show that the image representations from vision models can be transferred as continuous prompts to frozen LMs by training only a single linear projection. Using these to prompt the LM achieves competitive performance on captioning and visual question answering tasks compared to models that tune both the image encoder and text decoder (such as the MAGMA model). We compare three image encoders with increasing amounts of linguistic supervision seen during pretraining: BEIT (no linguistic information), NF-ResNET (lexical category information), and CLIP (full natural language descriptions). We find that all three encoders perform equally well at transferring visual property information to the language model (e.g., whether an animal is large or small), but that image encoders pretrained with linguistic supervision more saliently encode category information (e.g., distinguishing hippo vs. elephant) and thus perform significantly better on benchmark language-and-vision tasks. Our results indicate that LMs encode conceptual information structurally similarly to vision-based models, even those that are solely trained on images. Code is available here: https://github.com/jmerullo/limber

下载PDF全文

下载文献需遵守相关版权规定

论文标题