论文标题
高效变压器的视频预测
Video Prediction by Efficient Transformers
论文作者
论文摘要
视频预测是一项具有挑战性的计算机视觉任务,具有广泛的应用程序。在这项工作中,我们介绍了一个新的基于变压器的模型,用于视频预测。首先,提出了有效的局部空间分离注意机制来降低标准变压器的复杂性。然后,基于新的有效变压器开发了完整的自动回归模型,部分自回归模型和非自动回忆模型。部分自回旋模型的性能与完整的自回归模型相似,但推理速度更快。非自动性模型不仅可以实现更快的推理速度,而且还减轻了自回归对应物的质量降解问题,而且还需要学习的其他参数和损失功能。鉴于相同的注意机制,我们进行了一项全面的研究,以比较提出的三个视频预测变体。实验表明,所提出的视频预测模型具有更复杂的最新卷积LSTM模型。源代码可在https://github.com/xiye20/vptr上找到。
Video prediction is a challenging computer vision task that has a wide range of applications. In this work, we present a new family of Transformer-based models for video prediction. Firstly, an efficient local spatial-temporal separation attention mechanism is proposed to reduce the complexity of standard Transformers. Then, a full autoregressive model, a partial autoregressive model and a non-autoregressive model are developed based on the new efficient Transformer. The partial autoregressive model has a similar performance with the full autoregressive model but a faster inference speed. The non-autoregressive model not only achieves a faster inference speed but also mitigates the quality degradation problem of the autoregressive counterparts, but it requires additional parameters and loss function for learning. Given the same attention mechanism, we conducted a comprehensive study to compare the proposed three video prediction variants. Experiments show that the proposed video prediction models are competitive with more complex state-of-the-art convolutional-LSTM based models. The source code is available at https://github.com/XiYe20/VPTR.