iQuery：仪器作为视听声音分离的查询

论文标题

iQuery：仪器作为视听声音分离的查询

iQuery: Instruments as Queries for Audio-Visual Sound Separation

论文作者

Chen, Jiaben, Zhang, Renrui, Lian, Dongze, Yang, Jiaqi, Zeng, Ziyao, Shi, Jianbo

论文摘要

当前的音频视觉分离方法共享标准体系结构设计，其中音频编码器 - 编码器网络与Encoder Bottleneck的视觉编码功能融合在一起。这种设计将多模式功能编码的学习与可靠的声音解码进行音频分离混淆。要概括到新的乐器：必须为所有乐器的整个视觉和镜头网络填补。我们重新构建了视觉隔离任务，并以灵活的查询扩展机制为查询（iQuery）提出仪器。我们的方法确保了跨模式的一致性和跨弹药纪念碑。我们利用“视觉命名”查询来启动音频查询的学习，并使用跨模式的注意来删除估计波形处的潜在声源干扰。为了概括为新的乐器或活动类别，从文本推出设计中汲取灵感，我们在冻结注意力机制的同时将其他查询作为音频提示插入。三个基准测试的实验结果表明，我们的iQuery改善了视听声源分离性能。

Current audio-visual separation methods share a standard architecture design where an audio encoder-decoder network is fused with visual encoding features at the encoder bottleneck. This design confounds the learning of multi-modal feature encoding with robust sound decoding for audio separation. To generalize to a new instrument: one must finetune the entire visual and audio network for all musical instruments. We re-formulate visual-sound separation task and propose Instrument as Query (iQuery) with a flexible query expansion mechanism. Our approach ensures cross-modal consistency and cross-instrument disentanglement. We utilize "visually named" queries to initiate the learning of audio queries and use cross-modal attention to remove potential sound source interference at the estimated waveforms. To generalize to a new instrument or event class, drawing inspiration from the text-prompt design, we insert an additional query as an audio prompt while freezing the attention mechanism. Experimental results on three benchmarks demonstrate that our iQuery improves audio-visual sound source separation performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题