论文标题

使用状态空间变压器有效的电影场景检测

Efficient Movie Scene Detection using State-Space Transformers

论文作者

Islam, Md Mohaiminul, Hasan, Mahmudul, Athrey, Kishan Shamsundar, Braskich, Tony, Bertasius, Gedas

论文摘要

区分不同电影场景的能力对于理解电影的故事情节至关重要。但是,准确地检测电影场景通常具有挑战性,因为它需要能够在很长的电影片段中推理。这与大多数现有的视频识别模型相反,后者通常是为短范围视频分析而设计的。这项工作提出了一个状态空间变压器模型,该模型可以在长电影视频中有效捕获依赖项,以进行准确的电影场景检测。我们的模型称为Trans4mer,是使用新型S4A构建块构建的,该构建块结合了结构化状态空间序列(S4)和自我注意(a)层的优势。给定一系列帧分为电影拍摄(相机位置不变的不间断时期),S4A块首先应用自我注意力来捕获短距离内部镜头依赖性。之后,使用S4A块中的状态空间操作来汇总长期射击线索。最终的Trans4mer模型可以端到端训练,是通过多次堆叠S4A块来获得的。我们提出的Trans4mer在三个电影场景检测数据集中的所有先前方法(包括Movienet,BBC和OVSD)都优于所有先前的方法,而比标准变压器模型比$ 2 \ times $ $ 2 \ times $ $ 3 \ times $ bpu。我们将发布我们的代码和模型。

The ability to distinguish between different movie scenes is critical for understanding the storyline of a movie. However, accurately detecting movie scenes is often challenging as it requires the ability to reason over very long movie segments. This is in contrast to most existing video recognition models, which are typically designed for short-range video analysis. This work proposes a State-Space Transformer model that can efficiently capture dependencies in long movie videos for accurate movie scene detection. Our model, dubbed TranS4mer, is built using a novel S4A building block, which combines the strengths of structured state-space sequence (S4) and self-attention (A) layers. Given a sequence of frames divided into movie shots (uninterrupted periods where the camera position does not change), the S4A block first applies self-attention to capture short-range intra-shot dependencies. Afterward, the state-space operation in the S4A block is used to aggregate long-range inter-shot cues. The final TranS4mer model, which can be trained end-to-end, is obtained by stacking the S4A blocks one after the other multiple times. Our proposed TranS4mer outperforms all prior methods in three movie scene detection datasets, including MovieNet, BBC, and OVSD, while also being $2\times$ faster and requiring $3\times$ less GPU memory than standard Transformer models. We will release our code and models.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源