论文标题
TallFormer:具有长期内存变压器的时间动作定位
TALLFormer: Temporal Action Localization with a Long-memory Transformer
论文作者
论文摘要
时间动作定位中的大多数现代方法将此问题分为两个部分:(i)短期特征提取和(ii)远程时间边界定位。由于处理长期未修剪的视频引起的GPU内存成本很高,许多方法通过冷冻骨干或使用小型空间视频分辨率来牺牲短期特征提取器的代表力。由于最近的视频变压器模型,其中许多具有二次记忆复杂性,此问题变得更糟。为了解决这些问题,我们提出了TallFormer,这是具有长期内存的一种记忆效率和端到端的可训练时间动作定位变压器。我们的长期记忆机制消除了在每个训练迭代中处理数百个冗余视频帧的需求,因此大大减少了GPU的记忆消耗和训练时间。这些效率节省使我们(i)可以使用功能强大的视频变压器提取器而无需冷冻骨架或减少空间视频分辨率,而(ii)还保持了远程时间范围边界定位能力。只有RGB框架作为输入,没有外部动作识别分类器,TallFormer的表现要优于先前的最先前的最先前的边距,在Thumos14上获得了59.1%的平均地图,而ActivityNet-1.3的平均地图为35.6%。该代码可公开:https://github.com/klauscc/tallformer。
Most modern approaches in temporal action localization divide this problem into two parts: (i) short-term feature extraction and (ii) long-range temporal boundary localization. Due to the high GPU memory cost caused by processing long untrimmed videos, many methods sacrifice the representational power of the short-term feature extractor by either freezing the backbone or using a small spatial video resolution. This issue becomes even worse with the recent video transformer models, many of which have quadratic memory complexity. To address these issues, we propose TALLFormer, a memory-efficient and end-to-end trainable Temporal Action Localization Transformer with Long-term memory. Our long-term memory mechanism eliminates the need for processing hundreds of redundant video frames during each training iteration, thus, significantly reducing the GPU memory consumption and training time. These efficiency savings allow us (i) to use a powerful video transformer feature extractor without freezing the backbone or reducing the spatial video resolution, while (ii) also maintaining long-range temporal boundary localization capability. With only RGB frames as input and no external action recognition classifier, TALLFormer outperforms previous state-of-the-arts by a large margin, achieving an average mAP of 59.1% on THUMOS14 and 35.6% on ActivityNet-1.3. The code is public available: https://github.com/klauscc/TALLFormer.