通过数据扩展学习时间不变和可本质的功能以进行视频识别

论文标题

通过数据扩展学习时间不变和可本质的功能以进行视频识别

Learning Temporally Invariant and Localizable Features via Data Augmentation for Video Recognition

论文作者

Kim, Taeoh, Lee, Hyeongmin, Cho, MyeongAh, Lee, Ho Seong, Cho, Dong Heon, Lee, Sangyoun

论文摘要

基于深度学习的视频识别已显示出有希望的改进，以及大规模数据集和时空网络体系结构的开发。在图像识别中，学习空间不变的特征是改善识别性能和鲁棒性的关键因素。基于视觉归纳先验的数据增强，例如裁剪，翻转，旋转或光度抖动，是实现这些特征的代表性方法。最近的最新识别解决方案依赖于利用增强操作的混合的现代数据增强策略。在这项研究中，我们将这些策略扩展到视频的时间维度，以学习时间不变或时间上的本地化功能，以涵盖视频中的时间扰动或复杂的动作。基于我们新颖的时间数据增强算法，与仅空间数据增强算法相比，仅使用有限的培训数据来改善视频识别性能，包括第一个视觉归纳先验（VIPRIORS），以实现数据效率的动作识别挑战。此外，学到的特征在时间上是可定位的，使用空间增强算法无法实现。我们的源代码可从https://github.com/taeoh-kim/temporal_data_augmentation获得。

Deep-Learning-based video recognition has shown promising improvements along with the development of large-scale datasets and spatiotemporal network architectures. In image recognition, learning spatially invariant features is a key factor in improving recognition performance and robustness. Data augmentation based on visual inductive priors, such as cropping, flipping, rotating, or photometric jittering, is a representative approach to achieve these features. Recent state-of-the-art recognition solutions have relied on modern data augmentation strategies that exploit a mixture of augmentation operations. In this study, we extend these strategies to the temporal dimension for videos to learn temporally invariant or temporally localizable features to cover temporal perturbations or complex actions in videos. Based on our novel temporal data augmentation algorithms, video recognition performances are improved using only a limited amount of training data compared to the spatial-only data augmentation algorithms, including the 1st Visual Inductive Priors (VIPriors) for data-efficient action recognition challenge. Furthermore, learned features are temporally localizable that cannot be achieved using spatial augmentation algorithms. Our source code is available at https://github.com/taeoh-kim/temporal_data_augmentation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题