论文标题
拍手:从自然语言监督中学习音频概念
CLAP: Learning Audio Concepts From Natural Language Supervision
论文作者
论文摘要
主流音频分析模型经过培训,可以在一个班级标签的范式下学习到许多录音,重点是一项任务。在此类受限监督下学习限制了模型的灵活性,因为它们需要标记的音频进行培训,并且只能预测预定义的类别。相反,我们建议从自然语言监督中学习音频概念。我们称我们的方法对比性语言审计(拍手),该方法通过使用两个编码器和对比度学习来将音频和文本描述带入共同的多模式空间,从而学习了语言和音频。我们用128K音频和文本对训练了拍手,并在8个域的下游任务上进行了评估,例如声音事件分类,音乐任务和与语音相关的任务。尽管拍手的训练比相似的计算机视觉模型少得多,但它为零拍摄性能建立了SOTA。此外,我们在监督的学习设置中评估了拍手,并在5个任务中实现了SOTA。因此,Clap的零拍功能消除了使用类标签的培训,可以在推理时间进行灵活的类预测,并概括到多个下游任务。
Mainstream Audio Analytics models are trained to learn under the paradigm of one class label to many recordings focusing on one task. Learning under such restricted supervision limits the flexibility of models because they require labeled audio for training and can only predict the predefined categories. Instead, we propose to learn audio concepts from natural language supervision. We call our approach Contrastive Language-Audio Pretraining (CLAP), which learns to connect language and audio by using two encoders and a contrastive learning to bring audio and text descriptions into a joint multimodal space. We trained CLAP with 128k audio and text pairs and evaluated it on 16 downstream tasks across 8 domains, such as Sound Event Classification, Music tasks, and Speech-related tasks. Although CLAP was trained with significantly less pairs than similar computer vision models, it establishes SoTA for Zero-Shot performance. Additionally, we evaluated CLAP in a supervised learning setup and achieve SoTA in 5 tasks. Hence, CLAP's Zero-Shot capability removes the need of training with class labels, enables flexible class prediction at inference time, and generalizes to multiple downstream tasks.