查看过去的单词：在计算任务上测试验证V＆L模型的跨模式功能

论文标题

查看过去的单词：在计算任务上测试验证V＆L模型的跨模式功能

Seeing past words: Testing the cross-modal capabilities of pretrained V&L models on counting tasks

论文作者

Parcalabescu, Letitia, Gatt, Albert, Frank, Anette, Calixto, Iacer

论文摘要

我们研究了需要多模式集成的两个任务中预审前的视觉和语言（V＆L）模型的推理能力：（1）将正确的图像句子对与不正确的图像句子对区分，以及（2）计数图像中的实体。我们在这些任务上评估了三个审慎的V＆L模型：Vilbert，Vilbert 12 in-1和LXMERT，以零拍和填充设置。我们的结果表明，模型可以很好地解决任务（1），因为所有模型均已在任务（1）上进行了预期。但是，未经验证的V＆L模型都无法充分解决任务（2），我们的计数探测器，它们不能推广到分布量的数量。我们为这些发现提出了许多解释：LXMERT（在某种程度上，Vilbert 12 in-1）显示了一些灾难性遗忘的证据（1）。关于我们在计数探针上的结果，我们发现证据表明所有模型都受数据集偏差的影响，并且也无法在视觉输入中个性化实体。虽然预算V＆L模型的卖点是他们解决复杂任务的能力，但我们的发现表明，了解其推理和接地能力需要对特定现象进行更多针对性的研究。

We investigate the reasoning ability of pretrained vision and language (V&L) models in two tasks that require multimodal integration: (1) discriminating a correct image-sentence pair from an incorrect one, and (2) counting entities in an image. We evaluate three pretrained V&L models on these tasks: ViLBERT, ViLBERT 12-in-1 and LXMERT, in zero-shot and finetuned settings. Our results show that models solve task (1) very well, as expected, since all models are pretrained on task (1). However, none of the pretrained V&L models is able to adequately solve task (2), our counting probe, and they cannot generalise to out-of-distribution quantities. We propose a number of explanations for these findings: LXMERT (and to some extent ViLBERT 12-in-1) show some evidence of catastrophic forgetting on task (1). Concerning our results on the counting probe, we find evidence that all models are impacted by dataset bias, and also fail to individuate entities in the visual input. While a selling point of pretrained V&L models is their ability to solve complex tasks, our findings suggest that understanding their reasoning and grounding capabilities requires more targeted investigations on specific phenomena.

下载PDF全文

下载文献需遵守相关版权规定

论文标题