本地化视觉声音以简单的方式定位

论文标题

本地化视觉声音以简单的方式定位

Localizing Visual Sounds the Easy Way

论文作者

Mo, Shentong, Morgado, Pedro

论文摘要

无监督的视听源本地化旨在在视频中本地化可见的声源，而无需依靠地面真相本地化进行培训。以前的作品通常会为可能的负面区域（可能的负面区域）寻求高音频视觉相似性，并且可能的负相似性很低。但是，在没有手动注释的情况下，准确区分声音和非声音区域是具有挑战性的。在这项工作中，我们提出了一种简单而有效的方法，用于简单的视觉声音本地化，即EZ-VSL，而无需依靠培训期间的正和/或负区域的构建。取而代之的是，我们通过寻找与关联图像的一个位置相符的视听表示形式来对齐音频和视觉空间，而在任何位置都不匹配其他图像。我们还在推理时间引入了一种新颖的对象引导定位方案，以提高精度。我们简单有效的框架在两个流行的基准测试网和VGG-SOUND源上实现了最新的性能。特别是，我们将Flickr SoundNet测试集的CIOU从76.80％提高到83.94％，而在VGG-SOND源数据集则将其从34.60％到38.85％。该代码可在https://github.com/stonemo/ez-vsl上找到。

Unsupervised audio-visual source localization aims at localizing visible sound sources in a video without relying on ground-truth localization for training. Previous works often seek high audio-visual similarities for likely positive (sounding) regions and low similarities for likely negative regions. However, accurately distinguishing between sounding and non-sounding regions is challenging without manual annotations. In this work, we propose a simple yet effective approach for Easy Visual Sound Localization, namely EZ-VSL, without relying on the construction of positive and/or negative regions during training. Instead, we align audio and visual spaces by seeking audio-visual representations that are aligned in, at least, one location of the associated image, while not matching other images, at any location. We also introduce a novel object guided localization scheme at inference time for improved precision. Our simple and effective framework achieves state-of-the-art performance on two popular benchmarks, Flickr SoundNet and VGG-Sound Source. In particular, we improve the CIoU of the Flickr SoundNet test set from 76.80% to 83.94%, and on the VGG-Sound Source dataset from 34.60% to 38.85%. The code is available at https://github.com/stoneMo/EZ-VSL.

下载PDF全文

下载文献需遵守相关版权规定

论文标题