探索用于音频学习的火车和测试时间增加

论文标题

探索用于音频学习的火车和测试时间增加

Exploring Train and Test-Time Augmentations for Audio-Language Learning

论文作者

Kim, Eungbeom, Kim, Jinhee, Oh, Yoori, Kim, Kyungsu, Park, Minju, Sim, Jaeheon, Lee, Jinwoo, Lee, Kyogu

论文摘要

在本文中，我们旨在揭示数据增强在音频多模式学习中的影响，尽管它很重要，但尚未探索。我们不仅在火车时间，还可以测试时间探索各种增强方法，并发现适当的数据增加可以导致实质性改进。具体而言，应用我们提出的音频配对增强配对，这是第一个多模式音频语言增强方法，优于自动音频字幕和音频text检索任务的基准。为了充分利用数据增强，我们还为测试时间提供了多级测试时间增强（Multi-TTA）。我们成功地纳入了两种建议的方法和单模式的增强，并在音频字幕上实现了47.5蜘蛛，这比基线相对增长了18.2％。在音频文本检索中，提出的方法也显示出性能的改善。

In this paper, we aim to unveil the impact of data augmentation in audio-language multi-modal learning, which has not been explored despite its importance. We explore various augmentation methods at not only train-time but also test-time and find out that proper data augmentation can lead to substantial improvements. Specifically, applying our proposed audio-language paired augmentation PairMix, which is the first multi-modal audio-language augmentation method, outperforms the baselines for both automated audio captioning and audio-text retrieval tasks. To fully take advantage of data augmentation, we also present multi-level test-time augmentation (Multi-TTA) for the test-time. We successfully incorporate the two proposed methods and uni-modal augmentations and achieve 47.5 SPIDEr on audio captioning, which is an 18.2% relative increase over the baseline. In audio-text retrieval, the proposed methods also show an improvement in performance as well.

下载PDF全文

下载文献需遵守相关版权规定

论文标题