Autofoley：深度学习的无声视频的同步声音曲目的人为合成

论文标题

Autofoley：深度学习的无声视频的同步声音曲目的人为合成

AutoFoley: Artificial Synthesis of Synchronized Sound Tracks for Silent Videos with Deep Learning

论文作者

Ghose, Sanchita, Prevost, John J.

论文摘要

在电影制作中，Foley艺术家负责创建一个覆盖配乐，以帮助观众活着。这要求艺术家首先确定将增强听众体验的声音，从而加强董事对给定场景的意图。在本文中，我们介绍了Autofoley，这是一种完全自动化的深度学习工具，可用于合成视频的代表性音轨。 Autofole可以在没有与视频关联的相应音频文件或需要识别关键场景并提供合成的，加强的配乐的情况下使用Autofole。综合配乐的重要性能标准是通过输入视频进行时间同步，该视频提供了合成声音的现实且可信的刻画。与现有的声音预测和生成体系结构不同，我们的算法能够通过合并插值技术和时间关系网络（TRN）来精确地识别动作以及快速移动视频剪辑中的框架间关系。我们采用与卷积神经网络（CNN）相关的强大多尺度复发性神经网络（RNN），以更好地理解随着时间的推移复杂的输入到输出关联。为了评估Autofoley，我们创建并引入了一个大型音频视频数据集，该数据集包含各种经常用作电影中Foley效果的声音。我们的实验表明，合成的声音实际上是通过相关视觉输入的准确时间同步的。对自动纤维的人类定性测试表明，超过73％的测试对象将产生的配乐视为原始配乐，这是声音合成中跨模式研究的一个值得注意的改进。

In movie productions, the Foley Artist is responsible for creating an overlay soundtrack that helps the movie come alive for the audience. This requires the artist to first identify the sounds that will enhance the experience for the listener thereby reinforcing the Directors's intention for a given scene. In this paper, we present AutoFoley, a fully-automated deep learning tool that can be used to synthesize a representative audio track for videos. AutoFoley can be used in the applications where there is either no corresponding audio file associated with the video or in cases where there is a need to identify critical scenarios and provide a synthesized, reinforced soundtrack. An important performance criterion of the synthesized soundtrack is to be time-synchronized with the input video, which provides for a realistic and believable portrayal of the synthesized sound. Unlike existing sound prediction and generation architectures, our algorithm is capable of precise recognition of actions as well as inter-frame relations in fast moving video clips by incorporating an interpolation technique and Temporal Relationship Networks (TRN). We employ a robust multi-scale Recurrent Neural Network (RNN) associated with a Convolutional Neural Network (CNN) for a better understanding of the intricate input-to-output associations over time. To evaluate AutoFoley, we create and introduce a large scale audio-video dataset containing a variety of sounds frequently used as Foley effects in movies. Our experiments show that the synthesized sounds are realistically portrayed with accurate temporal synchronization of the associated visual inputs. Human qualitative testing of AutoFoley show over 73% of the test subjects considered the generated soundtrack as original, which is a noteworthy improvement in cross-modal research in sound synthesis.

下载PDF全文

下载文献需遵守相关版权规定

论文标题