从字幕视频中发现独立的实体

论文标题

从字幕视频中发现独立的实体

Self-Contained Entity Discovery from Captioned Videos

论文作者

Ayoughi, Melika, Mettes, Pascal, Groth, Paul

论文摘要

本文在视频中介绍了视觉命名实体发现的任务，而无需特定于任务的监督或特定于任务的外部知识源。在视频帧中为实体（例如面，场景或对象）分配特定名称是一个长期的挑战。通常，通过用实体标签手动注释面孔作为监督学习目标解决。为了绕过该设置的注释负担，几项工作通过使用外部知识来源（例如电影数据库）来研究了问题。虽然有效，但是当未提供特定于任务的知识源，并且只能应用于电影和电视连续剧时，此类方法无效。在这项工作中，我们将问题进一步发展，并建议在视频和相应字幕或字幕的视频中发现实体。我们引入了一种三阶段方法，在其中（i）从框架符号对创建二分实体图形，（ii）找到视觉实体协议，（iii）通过实体级的原型构建来完善实体分配。为了解决这个新问题，我们根据朋友和大爆炸理论电视连续剧概述了两个新的基准SC-Friends和SC-BBT。基准上的实验表明，我们的方法发现哪个命名实体属于哪个面部或场景，其精度接近有监督的甲骨文，仅来自视频中存在的多模式信息。此外，我们的定性示例还表明，对未来工作的任何视觉实体发现的潜在挑战。代码和数据可在GitHub上找到。

This paper introduces the task of visual named entity discovery in videos without the need for task-specific supervision or task-specific external knowledge sources. Assigning specific names to entities (e.g. faces, scenes, or objects) in video frames is a long-standing challenge. Commonly, this problem is addressed as a supervised learning objective by manually annotating faces with entity labels. To bypass the annotation burden of this setup, several works have investigated the problem by utilizing external knowledge sources such as movie databases. While effective, such approaches do not work when task-specific knowledge sources are not provided and can only be applied to movies and TV series. In this work, we take the problem a step further and propose to discover entities in videos from videos and corresponding captions or subtitles. We introduce a three-stage method where we (i) create bipartite entity-name graphs from frame-caption pairs, (ii) find visual entity agreements, and (iii) refine the entity assignment through entity-level prototype construction. To tackle this new problem, we outline two new benchmarks SC-Friends and SC-BBT based on the Friends and Big Bang Theory TV series. Experiments on the benchmarks demonstrate the ability of our approach to discover which named entity belongs to which face or scene, with an accuracy close to a supervised oracle, just from the multimodal information present in videos. Additionally, our qualitative examples show the potential challenges of self-contained discovery of any visual entity for future work. The code and the data are available on GitHub.

下载PDF全文

下载文献需遵守相关版权规定

论文标题