论文标题
运动和上下文感知的视频视频预测
Motion and Context-Aware Audio-Visual Conditioned Video Prediction
论文作者
论文摘要
视频调节视频预测的现有最新方法使用来自多模式随机网络和框架编码器的音频视频帧的潜在代码来预测下一个视觉帧。但是,由于具有高维图像空间,针对下一个视觉框架的每个像素强度的直接推断非常具有挑战性。为此,我们将视听的视频预测与运动和外观建模相结合。多模式运动估计基于音频运动相关性预测未来的光流。视觉分支从从音频功能构建的运动内存中回忆起,以实现更好的长期预测。我们进一步提出上下文感知的完善,以解决长期连续扭曲中全球外观背景的减少。全局外观上下文是由上下文编码提取的,并通过运动条件的仿射转换来操纵,然后与翘曲帧的特征融合在一起。实验结果表明,我们的方法在现有基准上取得了竞争成果。
The existing state-of-the-art method for audio-visual conditioned video prediction uses the latent codes of the audio-visual frames from a multimodal stochastic network and a frame encoder to predict the next visual frame. However, a direct inference of per-pixel intensity for the next visual frame is extremely challenging because of the high-dimensional image space. To this end, we decouple the audio-visual conditioned video prediction into motion and appearance modeling. The multimodal motion estimation predicts future optical flow based on the audio-motion correlation. The visual branch recalls from the motion memory built from the audio features to enable better long term prediction. We further propose context-aware refinement to address the diminishing of the global appearance context in the long-term continuous warping. The global appearance context is extracted by the context encoder and manipulated by motion-conditioned affine transformation before fusion with features of warped frames. Experimental results show that our method achieves competitive results on existing benchmarks.