论文标题

INR-V:基于视频生成任务的连续表示空间

INR-V: A Continuous Representation Space for Video-based Generative Tasks

论文作者

Sen, Bipasha, Agarwal, Aditya, Namboodiri, Vinay P, Jawahar, C. V.

论文摘要

生成视频是一项复杂的任务,可以通过生成一组临时图像来完成。这将视频的表达性限制为仅在需要网络设计的单个视频帧上基于图像的操作,以便在基础图像空间中获得时间连贯的轨迹。我们提出了一个视频表示网络INR-V,它为基于视频的生成任务提供了连续的空间。 INR-V使用隐式神经表示(INRS)参数化视频,这是一种多层感知器,可预测视频的每个输入像素位置的RGB值。使用元网络预测INR,该网络是一款针对多个视频实例的神经表示训练的超网络。后来,可以对元网络进行采样,以生成各种新颖的视频,从而实现许多基于下游视频的生成任务。有趣的是,我们发现有条件的正则化和进行性重量初始化在获得INR-V中起着至关重要的作用。 INR-V学到的表示空间比在现有作品中展示许多有趣属性的图像空间更具表现力。例如,INR-V可以在已知的视频实例(例如中间身份,表达式和面部视频中的姿势)之间平滑插值中间视频。它还可以在视频中丢失片段,以恢复时间连贯的完整视频。在这项工作中,我们评估了INR-V学到的空间,包括视频插值,新颖的视频生成,视频倒置以及针对现有基线的视频介绍。 INR-V在其中一些显示的任务上明显优于基准,清楚地展示了所提出的表示空间的潜力。

Generating videos is a complex task that is accomplished by generating a set of temporally coherent images frame-by-frame. This limits the expressivity of videos to only image-based operations on the individual video frames needing network designs to obtain temporally coherent trajectories in the underlying image space. We propose INR-V, a video representation network that learns a continuous space for video-based generative tasks. INR-V parameterizes videos using implicit neural representations (INRs), a multi-layered perceptron that predicts an RGB value for each input pixel location of the video. The INR is predicted using a meta-network which is a hypernetwork trained on neural representations of multiple video instances. Later, the meta-network can be sampled to generate diverse novel videos enabling many downstream video-based generative tasks. Interestingly, we find that conditional regularization and progressive weight initialization play a crucial role in obtaining INR-V. The representation space learned by INR-V is more expressive than an image space showcasing many interesting properties not possible with the existing works. For instance, INR-V can smoothly interpolate intermediate videos between known video instances (such as intermediate identities, expressions, and poses in face videos). It can also in-paint missing portions in videos to recover temporally coherent full videos. In this work, we evaluate the space learned by INR-V on diverse generative tasks such as video interpolation, novel video generation, video inversion, and video inpainting against the existing baselines. INR-V significantly outperforms the baselines on several of these demonstrated tasks, clearly showcasing the potential of the proposed representation space.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源