Winogavil：游戏协会基准测试以挑战视觉和语言模型

论文标题

Winogavil：游戏协会基准测试以挑战视觉和语言模型

WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models

论文作者

Bitton, Yonatan, Guetta, Nitzan Bitton, Yosef, Ron, Elovici, Yuval, Bansal, Mohit, Stanovsky, Gabriel, Schwartz, Roy

论文摘要

虽然视觉和语言模型在视觉问题回答等任务上表现良好，但在基本的人类常识性推理技能方面，它们会挣扎。在这项工作中，我们介绍了Winogavil：视觉和语言协会的在线游戏（例如，在狼人和满月之间），用作动态评估基准。受欢迎的纸牌游戏代号的启发，Spymaster提供了与几个视觉候选者相关的文本提示，另一个玩家试图识别它们。人类玩家因创建对竞争对手AI模型而具有挑战性的联想而获得了回报，但仍可以由其他人类玩家解决。我们使用该游戏来收集3.5k个实例，发现它们对人类（> 90％的Jaccard索引）而言是直观的，但对于最先进的AI模型而言，最佳模型（Vilt）的得分为52％，成功的型号，主要是在视觉上明显的。我们的分析以及从玩家那里收集的反馈表明，收集的关联需要多种推理技能，包括一般知识，常识，抽象等。我们发布数据集，代码和交互式游戏，允许未来的数据收集，可用于开发具有更好关联能力的模型。

While vision-and-language models perform well on tasks such as visual question answering, they struggle when it comes to basic human commonsense reasoning skills. In this work, we introduce WinoGAViL: an online game of vision-and-language associations (e.g., between werewolves and a full moon), used as a dynamic evaluation benchmark. Inspired by the popular card game Codenames, a spymaster gives a textual cue related to several visual candidates, and another player tries to identify them. Human players are rewarded for creating associations that are challenging for a rival AI model but still solvable by other human players. We use the game to collect 3.5K instances, finding that they are intuitive for humans (>90% Jaccard index) but challenging for state-of-the-art AI models, where the best model (ViLT) achieves a score of 52%, succeeding mostly where the cue is visually salient. Our analysis as well as the feedback we collect from players indicate that the collected associations require diverse reasoning skills, including general knowledge, common sense, abstraction, and more. We release the dataset, the code and the interactive game, allowing future data collection that can be used to develop models with better association abilities.

下载PDF全文

下载文献需遵守相关版权规定

论文标题