统一-IO：视觉，语言和多模式任务的统一模型

论文标题

统一-IO：视觉，语言和多模式任务的统一模型

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

论文作者

Lu, Jiasen, Clark, Christopher, Zellers, Rowan, Mottaghi, Roozbeh, Kembhavi, Aniruddha

论文摘要

我们提出了Unified-io，该模型执行了跨越经典计算机视觉任务任务的各种AI任务，包括姿势估计，对象检测，深度估计和图像产生，视觉和语言任务，例如区域字幕和引用表达，以自然语言处理任务，例如询问答案和paraphrasing。由于与每个任务有关的异质输入和输出，包括RGB图像，每个像素映射，二进制掩码，边界框和语言，为如此多的任务开发单个统一模型引起了独特的挑战。我们通过将每个受支持的输入和输出均匀地归因于一系列离散的词汇令牌来实现这一统一。在所有任务中，这种共同的表示，使我们能够在视觉和语言领域的90多个不同数据集上培训单个基于变压器的体系结构。 Unified-IO是第一个能够在砂砾基准上执行所有7个任务的模型，并在Nyuv2-Depth，Imagenet，VQA2.0，OK-VQA，Swig，Vizwiz，Boolq和Scitail等16个不同的基准测试中产生强大的结果，没有任务特定的微调。统一-IO的代码和演示可在以下网址获得：https：//unified-io.allenai.org。

We propose Unified-IO, a model that performs a large variety of AI tasks spanning classical computer vision tasks, including pose estimation, object detection, depth estimation and image generation, vision-and-language tasks such as region captioning and referring expression, to natural language processing tasks such as question answering and paraphrasing. Developing a single unified model for such a large variety of tasks poses unique challenges due to the heterogeneous inputs and outputs pertaining to each task, including RGB images, per-pixel maps, binary masks, bounding boxes, and language. We achieve this unification by homogenizing every supported input and output into a sequence of discrete vocabulary tokens. This common representation across all tasks allows us to train a single transformer-based architecture, jointly on over 90 diverse datasets in the vision and language fields. Unified-IO is the first model capable of performing all 7 tasks on the GRIT benchmark and produces strong results across 16 diverse benchmarks like NYUv2-Depth, ImageNet, VQA2.0, OK-VQA, Swig, VizWizGround, BoolQ, and SciTail, with no task-specific fine-tuning. Code and demos for Unified-IO are available at: https://unified-io.allenai.org.

下载PDF全文

下载文献需遵守相关版权规定

论文标题