从生成的视频标题中提取事件和实体

论文标题

从生成的视频标题中提取事件和实体

Event and Entity Extraction from Generated Video Captions

论文作者

Scherer, Johannes, Scherp, Ansgar, Bhowmik, Deepayan

论文摘要

人类对多媒体数据的注释是耗时且昂贵的，而可靠的自动生成语义元数据是一个主要挑战。我们提出了一个从自动生成的视频字幕中提取语义元数据的框架。作为元数据，我们考虑实体，实体的属性，实体之间的关系和视频类别。我们使用蒙版变压器（MT）和并行解码（PVDC）使用两种最先进的视频字幕字幕模型来为活动网络字幕数据集的视频生成字幕。我们的实验表明，可以从生成的字幕中提取实体，其属性，实体之间的关系，实体之间的关系以及视频类别。我们观察到，提取的信息的质量主要受视频中事件本地化的质量以及事件字幕生成的性能的影响。

Annotation of multimedia data by humans is time-consuming and costly, while reliable automatic generation of semantic metadata is a major challenge. We propose a framework to extract semantic metadata from automatically generated video captions. As metadata, we consider entities, the entities' properties, relations between entities, and the video category. We employ two state-of-the-art dense video captioning models with masked transformer (MT) and parallel decoding (PVDC) to generate captions for videos of the ActivityNet Captions dataset. Our experiments show that it is possible to extract entities, their properties, relations between entities, and the video category from the generated captions. We observe that the quality of the extracted information is mainly influenced by the quality of the event localization in the video as well as the performance of the event caption generation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题