论文标题
剪辑事件:将文本和图像与事件结构连接
CLIP-Event: Connecting Text and Images with Event Structures
论文作者
论文摘要
视觉语言(V+L)预读模型通过了解图像和文本之间的对齐方式在支持多媒体应用方面取得了巨大成功。尽管现有的视觉预读图模型主要集中于文本中图像或实体中的对象,但它们通常会忽略事件及其参数结构的一致性。在这项工作中,我们提出了一个对比的学习框架,以实施视觉语言预处理模型,以理解事件和相关的论点(参与者)角色。为了实现这一目标,我们利用文本信息提取技术来获取事件结构知识,并利用多个提示功能通过操纵事件结构来对比困难的负面描述。我们还基于最佳传输来设计事件图对齐损失,以捕获事件参数结构。此外,我们为预处理收集了一个大型事件的数据集(106,875张图像),该数据集提供了更具挑战性的图像检索基准测试,以评估对复杂的冗长句子的理解。实验表明,我们的零拍夹事件在多媒体事件提取方面的参数提取中优于最新的监督模型,在事件提取中实现了超过5%的绝对F评分增益,并且在零射击设置下的各种下游任务上都有重大改进。
Vision-language (V+L) pretraining models have achieved great success in supporting multimedia applications by understanding the alignments between images and text. While existing vision-language pretraining models primarily focus on understanding objects in images or entities in text, they often ignore the alignment at the level of events and their argument structures. In this work, we propose a contrastive learning framework to enforce vision-language pretraining models to comprehend events and associated argument (participant) roles. To achieve this, we take advantage of text information extraction technologies to obtain event structural knowledge, and utilize multiple prompt functions to contrast difficult negative descriptions by manipulating event structures. We also design an event graph alignment loss based on optimal transport to capture event argument structures. In addition, we collect a large event-rich dataset (106,875 images) for pretraining, which provides a more challenging image retrieval benchmark to assess the understanding of complicated lengthy sentences. Experiments show that our zero-shot CLIP-Event outperforms the state-of-the-art supervised model in argument extraction on Multimedia Event Extraction, achieving more than 5% absolute F-score gain in event extraction, as well as significant improvements on a variety of downstream tasks under zero-shot settings.