使用视觉变压器利用空间稀疏性事件摄像机

论文标题

使用视觉变压器利用空间稀疏性事件摄像机

Exploiting Spatial Sparsity for Event Cameras with Visual Transformers

论文作者

Wang, Zuowen, Hu, Yuhuang, Liu, Shih-Chii

论文摘要

事件摄像机通过异步流的输出事件报告亮度的局部变化。在像素位置几乎没有亮度变化的像素位置，事件在空间上稀疏。我们建议使用视觉变压器（VIT）体系结构来利用其处理可变长度输入的能力。 VIT的输入由积累到时间箱中并在空间分离为称为斑块的非重叠子区域的事件组成。当子区域内的非零像素位置的数量高于阈值时，选择了补丁。我们表明，通过在所选的活动贴片上微调VIT模型，我们可以在推断期间将馈入主链的平均斑块数量至少减少50％，而N-Caltech101数据集的分类精度的仅少量下降（0.34％）。这种降低转化为多重蓄电（MAC）操作的51％，使用服务器CPU在推理速度上增加了46％。

Event cameras report local changes of brightness through an asynchronous stream of output events. Events are spatially sparse at pixel locations with little brightness variation. We propose using a visual transformer (ViT) architecture to leverage its ability to process a variable-length input. The input to the ViT consists of events that are accumulated into time bins and spatially separated into non-overlapping sub-regions called patches. Patches are selected when the number of nonzero pixel locations within a sub-region is above a threshold. We show that by fine-tuning a ViT model on the selected active patches, we can reduce the average number of patches fed into the backbone during the inference by at least 50% with only a minor drop (0.34%) of the classification accuracy on the N-Caltech101 dataset. This reduction translates into a decrease of 51% in Multiply-Accumulate (MAC) operations and an increase of 46% in the inference speed using a server CPU.

下载PDF全文

下载文献需遵守相关版权规定

论文标题