用上下文文本丰富视频字幕

论文标题

用上下文文本丰富视频字幕

Enriching Video Captions With Contextual Text

论文作者

Rimle, Philipp, Dogan, Pelin, Gross, Markus

论文摘要

了解视频内容并用上下文生成字幕是一项重要且具有挑战性的任务。与通常试图在没有上下文的情况下生成通用视频字幕的先前方法不同，我们的体系结构通过从相关文本数据中注入提取的信息来将字幕化为字幕。我们提出了一个端到端序列到序列模型，该模型基于视觉输入生成视频字幕，并挖掘相关的知识，例如上下文文本中的名称和位置。与以前的方法相反，我们不会进一步预处理文本，并让模型直接学会参加。在视觉输入的指导下，该模型能够通过指针生成网络从上下文文本中复制单词，从而可以产生更具体的视频字幕。我们在新闻视频数据集上显示了竞争性能，并通过消融研究验证上下文视频字幕的功效以及我们的模型架构中的个人设计选择。

Understanding video content and generating caption with context is an important and challenging task. Unlike prior methods that typically attempt to generate generic video captions without context, our architecture contextualizes captioning by infusing extracted information from relevant text data. We propose an end-to-end sequence-to-sequence model which generates video captions based on visual input, and mines relevant knowledge such as names and locations from contextual text. In contrast to previous approaches, we do not preprocess the text further, and let the model directly learn to attend over it. Guided by the visual input, the model is able to copy words from the contextual text via a pointer-generator network, allowing to produce more specific video captions. We show competitive performance on the News Video Dataset and, through ablation studies, validate the efficacy of contextual video captioning as well as individual design choices in our model architecture.

下载PDF全文

下载文献需遵守相关版权规定

论文标题