Unit3D：用于3D密集字幕和视觉接地的统一变压器

论文标题

Unit3D：用于3D密集字幕和视觉接地的统一变压器

UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding

论文作者

Chen, Dave Zhenyu, Hu, Ronghang, Chen, Xinlei, Nießner, Matthias, Chang, Angel X.

论文摘要

进行3D密集的字幕和视觉接地需要对基本多模式关系的共同和共同的理解。但是，尽管以前有一些尝试将这两个相关任务与高度任务特定的神经模块联系起来，但它仍然被研究了如何明确描述他们共同的性质以同时学习它们。在这项工作中，我们提出了Unit3D，这是一种简单而有效的完全统一的基于变压器的架构，用于共同求解3D视觉接地和密集的字幕。 Unit3D通过双向和SEQ-to-seq目标通过有监督的联合预训练方案来学习两个任务的强大多模式表示。借助通用体系结构设计，Unit3D允许将预训练范围扩展到更多的各种培训来源，例如从2D先验知识中的合成数据，以使3D Vision-Language任务受益。广泛的实验和分析表明，Unit3D获得了3D致密字幕和视觉接地的显着增长。

Performing 3D dense captioning and visual grounding requires a common and shared understanding of the underlying multimodal relationships. However, despite some previous attempts on connecting these two related tasks with highly task-specific neural modules, it remains understudied how to explicitly depict their shared nature to learn them simultaneously. In this work, we propose UniT3D, a simple yet effective fully unified transformer-based architecture for jointly solving 3D visual grounding and dense captioning. UniT3D enables learning a strong multimodal representation across the two tasks through a supervised joint pre-training scheme with bidirectional and seq-to-seq objectives. With a generic architecture design, UniT3D allows expanding the pre-training scope to more various training sources such as the synthesized data from 2D prior knowledge to benefit 3D vision-language tasks. Extensive experiments and analysis demonstrate that UniT3D obtains significant gains for 3D dense captioning and visual grounding.

下载PDF全文

下载文献需遵守相关版权规定

论文标题