论文标题
观众会感觉如何?从视频场景中估算福祉
How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios
论文作者
论文摘要
近年来,深层神经网络已经表现出越来越强大的能力来识别视频中的物体和活动。但是,随着视频理解被广泛用于现实世界的应用程序,一个关键的考虑因素是开发以人为中心的系统,不仅了解视频的内容,还可以理解它将如何影响观众的福祉和情感状态。为了促进在这种情况下的研究,我们介绍了两个大规模数据集,其中有超过60,000个视频手动注释以进行情感响应和主观的健康。视频认知同理心(VCE)数据集包含用于分布细粒情绪反应的注释,从而使模型能够获得对情感状态的详细理解。视频到价(V2V)数据集包含视频之间相对愉悦的注释,这使得可以预测连续的幸福感。在实验中,我们展示了如何进行训练以识别动作并找到对象轮廓的视频模型如何重新使用以了解人类的偏好和视频的情感内容。尽管有改进的余地,但预测最先进模型的福祉和情感反应即将到来。我们希望我们的数据集能够帮助在常识性视频理解和人类偏好学习的交集中促进进一步的进步。
In recent years, deep neural networks have demonstrated increasingly strong abilities to recognize objects and activities in videos. However, as video understanding becomes widely used in real-world applications, a key consideration is developing human-centric systems that understand not only the content of the video but also how it would affect the wellbeing and emotional state of viewers. To facilitate research in this setting, we introduce two large-scale datasets with over 60,000 videos manually annotated for emotional response and subjective wellbeing. The Video Cognitive Empathy (VCE) dataset contains annotations for distributions of fine-grained emotional responses, allowing models to gain a detailed understanding of affective states. The Video to Valence (V2V) dataset contains annotations of relative pleasantness between videos, which enables predicting a continuous spectrum of wellbeing. In experiments, we show how video models that are primarily trained to recognize actions and find contours of objects can be repurposed to understand human preferences and the emotional content of videos. Although there is room for improvement, predicting wellbeing and emotional response is on the horizon for state-of-the-art models. We hope our datasets can help foster further advances at the intersection of commonsense video understanding and human preference learning.