Data2Vec：语音，视觉和语言中自学学习的一般框架

论文标题

Data2Vec：语音，视觉和语言中自学学习的一般框架

data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

论文作者

Baevski, Alexei, Hsu, Wei-Ning, Xu, Qiantong, Babu, Arun, Gu, Jiatao, Auli, Michael

论文摘要

尽管自我监督学习的一般思想在跨模式之间是相同的，但实际的算法和目标差异很大，因为它们是通过单一的方式开发的。为了使我们更接近一般的自我监督学习，我们提出Data2Vec，该框架使用相同的学习方法来进行语音，NLP或计算机视觉。核心思想是，基于使用标准变压器体系结构的自鉴定设置中输入的掩盖视图来预测完整输入数据的潜在表示。 Data2Vec没有预测特定于模式的目标，例如本地本地的人类语音单位或人类语音单位，而是预测了包含整个输入信息的上下文化潜在表示。对语音识别，图像分类和自然语言理解的主要基准的实验证明了对主要方法的新现状或竞争性能。

While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind. To get us closer to general self-supervised learning, we present data2vec, a framework that uses the same learning method for either speech, NLP or computer vision. The core idea is to predict latent representations of the full input data based on a masked view of the input in a self-distillation setup using a standard Transformer architecture. Instead of predicting modality-specific targets such as words, visual tokens or units of human speech which are local in nature, data2vec predicts contextualized latent representations that contain information from the entire input. Experiments on the major benchmarks of speech recognition, image classification, and natural language understanding demonstrate a new state of the art or competitive performance to predominant approaches.

下载PDF全文

下载文献需遵守相关版权规定

论文标题