从语义上相似的样本中学习声音本地化更好

论文标题

从语义上相似的样本中学习声音本地化更好

Learning Sound Localization Better From Semantically Similar Samples

论文作者

Senocak, Arda, Ryu, Hyeonggon, Kim, Junsik, Kweon, In So

论文摘要

这项工作的目的是在视觉场景中定位声源。现有的视听作品通过从与阳性相同的来源分配相应的视听对，而随机不匹配的对作为否定来进行对比学习。但是，这些负面对可能包含语义匹配的视听信息。因此，这些语义相关的对“硬积极因素”被错误地分为负面。我们的关键贡献表明，硬质阳性可以给出与相应对的相似响应图。我们的方法将它们的响应图直接添加到对比度学习目标中，从而结合了这些硬积极性。我们证明了方法对VGG-SS和Soundnet-Flickr测试集的有效性，对最先进的方法表现出了良好的性能。

The objective of this work is to localize the sound sources in visual scenes. Existing audio-visual works employ contrastive learning by assigning corresponding audio-visual pairs from the same source as positives while randomly mismatched pairs as negatives. However, these negative pairs may contain semantically matched audio-visual information. Thus, these semantically correlated pairs, "hard positives", are mistakenly grouped as negatives. Our key contribution is showing that hard positives can give similar response maps to the corresponding pairs. Our approach incorporates these hard positives by adding their response maps into a contrastive learning objective directly. We demonstrate the effectiveness of our approach on VGG-SS and SoundNet-Flickr test sets, showing favorable performance to the state-of-the-art methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题