Audioscopev2：校准开放域内屏幕声音分离的视听注意体系结构

论文标题

Audioscopev2：校准开放域内屏幕声音分离的视听注意体系结构

AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation

论文作者

Tzinis, Efthymios, Wisdom, Scott, Remez, Tal, Hershey, John R.

论文摘要

我们介绍了Audioscopev2，这是一种最先进的通用音频视频在屏幕上的声音分离系统，该系统能够通过观看野外视频来学习将声音与屏幕上的对象相关联。我们确定了先前关于视听屏幕上声音分离的工作的几个局限性，包括在时空注意力的粗略分辨率，音频分离模型的收敛性差，训练和评估数据的差异有限，以及未能考虑到保留在屏幕上的声音和隔音抑制次要声音之间的权衡。我们为所有这些问题提供解决方案。我们提出的跨模式和自我发场网络体系结构随着时间的推移以精确的分辨率捕获了视听依赖性，我们还提出了有效的可分离变体，这些变体能够扩展到更长的视频而不会牺牲太多性能。我们还发现，仅在音频上进行分离模型会大大改善结果。为了进行培训和评估，我们从大型野外视频数据库（YFCC100M）中收集了新的屏幕上的人类注释。这个新数据集更加多样化和具有挑战性。最后，我们提出了一个校准过程，该过程允许对屏幕重建与屏幕外抑制进行精确调整，从而大大简化了具有不同操作点的模型之间的性能。总体而言，我们的实验结果表明，在更一般的条件下，屏幕分离性能的改善要比以前具有最小的额外计算复杂性的方法更为一般。

We introduce AudioScopeV2, a state-of-the-art universal audio-visual on-screen sound separation system which is capable of learning to separate sounds and associate them with on-screen objects by looking at in-the-wild videos. We identify several limitations of previous work on audio-visual on-screen sound separation, including the coarse resolution of spatio-temporal attention, poor convergence of the audio separation model, limited variety in training and evaluation data, and failure to account for the trade off between preservation of on-screen sounds and suppression of off-screen sounds. We provide solutions to all of these issues. Our proposed cross-modal and self-attention network architectures capture audio-visual dependencies at a finer resolution over time, and we also propose efficient separable variants that are capable of scaling to longer videos without sacrificing much performance. We also find that pre-training the separation model only on audio greatly improves results. For training and evaluation, we collected new human annotations of onscreen sounds from a large database of in-the-wild videos (YFCC100M). This new dataset is more diverse and challenging. Finally, we propose a calibration procedure that allows exact tuning of on-screen reconstruction versus off-screen suppression, which greatly simplifies comparing performance between models with different operating points. Overall, our experimental results show marked improvements in on-screen separation performance under much more general conditions than previous methods with minimal additional computational complexity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题