论文标题
对视频段落字幕进行情感观看
Taking an Emotional Look at Video Paragraph Captioning
论文作者
论文摘要
将视觉数据转化为自然语言对于理解世界并与人类互动的机器至关重要。在这项工作中,对视频段落字幕进行了全面研究,目的是为给定视频生成段落级别的描述。但是,当前的研究主要集中于检测客观事实,而忽略了建立句子之间逻辑关联的需求,而是发现与视频内容相关的更准确的情绪。这样的问题会损害流利而丰富的预测标题表达式,这些标题远低于人类语言坦达德。为了解决这个问题,我们建议为此任务构建一个大规模的情感和逻辑驱动的多语言数据集。该数据集命名为EMVPC(代表“情感视频段落字幕”),并包含53种在日常生活中广泛使用的情绪,376个与这些情绪相对应的常见场景,10,291个高质量的视频和20,582个带有英文和中文版本的段落字幕。此新数据集还提供了相关的情感类别,场景标签,情感单词标签和逻辑单词标签。拟议的EMVPC数据集打算以丰富的情感,连贯的逻辑和详尽的表达方式提供成熟的视频段落字幕,这也可以使视觉领域中的其他任务受益。此外,通过对现有基准视频段落标题和拟议的EMVPC进行实验,进行了一项全面的研究。通过15个流行指标比较了来自不同视觉字幕任务的状态方案,其详细目标和主观结果总结了。最后,还讨论了视频段落字幕的剩余问题和未来的方向。预计这项工作的独特观点将在视频段落字幕上提高进一步的发展。
Translating visual data into natural language is essential for machines to understand the world and interact with humans. In this work, a comprehensive study is conducted on video paragraph captioning, with the goal to generate paragraph-level descriptions for a given video. However, current researches mainly focus on detecting objective facts, ignoring the needs to establish the logical associations between sentences and to discover more accurate emotions related to video contents. Such a problem impairs fluent and abundant expressions of predicted captions, which are far below human language tandards. To solve this problem, we propose to construct a large-scale emotion and logic driven multilingual dataset for this task. This dataset is named EMVPC (standing for "Emotional Video Paragraph Captioning") and contains 53 widely-used emotions in daily life, 376 common scenes corresponding to these emotions, 10,291 high-quality videos and 20,582 elaborated paragraph captions with English and Chinese versions. Relevant emotion categories, scene labels, emotion word labels and logic word labels are also provided in this new dataset. The proposed EMVPC dataset intends to provide full-fledged video paragraph captioning in terms of rich emotions, coherent logic and elaborate expressions, which can also benefit other tasks in vision-language fields. Furthermore, a comprehensive study is conducted through experiments on existing benchmark video paragraph captioning datasets and the proposed EMVPC. The stateof-the-art schemes from different visual captioning tasks are compared in terms of 15 popular metrics, and their detailed objective as well as subjective results are summarized. Finally, remaining problems and future directions of video paragraph captioning are also discussed. The unique perspective of this work is expected to boost further development in video paragraph captioning research.