OFA：通过一个简单的顺序学习框架统一体系结构，任务和方式

论文标题

OFA：通过一个简单的顺序学习框架统一体系结构，任务和方式

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

论文作者

Wang, Peng, Yang, An, Men, Rui, Lin, Junyang, Bai, Shuai, Li, Zhikang, Ma, Jianxin, Zhou, Chang, Zhou, Jingren, Yang, Hongxia

论文摘要

在这项工作中，我们追求一个统一的范式进行多模式预处理，以打破复杂的任务/特定于模态定制的脚手架。我们提出了OFA，这是一个支持任务全面性的任务不合时宜的和模态的框架框架。 OFA统一了一套多样化的跨模式和单峰任务，包括图像生成，视觉接地，图像字幕，图像分类，语言建模等，以简单的顺序到序列学习框架。 OFA遵循基于教学的学习和填充阶段的学习，不需要额外的特定任务层来进行下游任务。与最近依靠极大的跨模式数据集的最新最先进的视觉和语言模型相比，OFA仅在2000万公开可用的图像文本对上预估计。尽管其简单性和相对较小的培训数据，OFA仍在一系列跨模式任务中实现了新的SOTA，同时在单峰任务上获得了竞争激烈的表现。我们的进一步分析表明，OFA也可以有效地转移到看不见的任务和看不见的领域。我们的代码和模型可在https://github.com/ofa-sys/ofa上公开获取。

In this work, we pursue a unified paradigm for multimodal pretraining to break the scaffolds of complex task/modality-specific customization. We propose OFA, a Task-Agnostic and Modality-Agnostic framework that supports Task Comprehensiveness. OFA unifies a diverse set of cross-modal and unimodal tasks, including image generation, visual grounding, image captioning, image classification, language modeling, etc., in a simple sequence-to-sequence learning framework. OFA follows the instruction-based learning in both pretraining and finetuning stages, requiring no extra task-specific layers for downstream tasks. In comparison with the recent state-of-the-art vision & language models that rely on extremely large cross-modal datasets, OFA is pretrained on only 20M publicly available image-text pairs. Despite its simplicity and relatively small-scale training data, OFA achieves new SOTAs in a series of cross-modal tasks while attaining highly competitive performances on uni-modal tasks. Our further analysis indicates that OFA can also effectively transfer to unseen tasks and unseen domains. Our code and models are publicly available at https://github.com/OFA-Sys/OFA.

下载PDF全文

下载文献需遵守相关版权规定

论文标题