更多：在3D场景中密集字幕的多阶关系挖掘

论文标题

更多：在3D场景中密集字幕的多阶关系挖掘

MORE: Multi-Order RElation Mining for Dense Captioning in 3D Scenes

论文作者

Jiao, Yang, Chen, Shaoxiang, Jie, Zequn, Chen, Jingjing, Ma, Lin, Jiang, Yu-Gang

论文摘要

3D密集字幕是最近提供的一项新任务，其中点云包含比2D对应物更多的几何信息。但是，由于点云中包含的复杂性较高，并且更广泛的对象关系，这也更具挑战性。现有方法仅将这种关系视为图表中对象特征学习的副产品，而无需特别编码它们，从而导致了亚最佳结果。在本文中，旨在通过捕获和利用3D场景中的复杂关系来改善3D密集的字幕，我们提出了更多的多阶关系挖掘模型，以支持产生更多的描述性和全面的字幕。从技术上讲，我们更多地以渐进的方式编码对象关系，因为可以从有限数量的基本关系中推导复杂关系。我们首先设计了一种新型的空间布局图卷积（SLGC），该图形将几个一阶关系编码为在3D对象建议上构造的图的边缘。接下来，从结果的图中，我们进一步提取多个三重态，这些三重态将基本的一阶关系封装为基本单元，并构建几个以对象为中心的三重态注意图（OTAG），以推断每个目标对象的多阶关系。将OTAG的更新的节点功能聚合并馈入标题解码器，以提供丰富的关系提示，以便可以生成包括与上下文对象的不同关系的字幕。 SCAN2CAP数据集的广泛实验证明了我们提出的更多及其组件的有效性，并且我们也表现出了当前最新方法。我们的代码可从https://github.com/sxjyjay/more获得。

3D dense captioning is a recently-proposed novel task, where point clouds contain more geometric information than the 2D counterpart. However, it is also more challenging due to the higher complexity and wider variety of inter-object relations contained in point clouds. Existing methods only treat such relations as by-products of object feature learning in graphs without specifically encoding them, which leads to sub-optimal results. In this paper, aiming at improving 3D dense captioning via capturing and utilizing the complex relations in the 3D scene, we propose MORE, a Multi-Order RElation mining model, to support generating more descriptive and comprehensive captions. Technically, our MORE encodes object relations in a progressive manner since complex relations can be deduced from a limited number of basic ones. We first devise a novel Spatial Layout Graph Convolution (SLGC), which semantically encodes several first-order relations as edges of a graph constructed over 3D object proposals. Next, from the resulting graph, we further extract multiple triplets which encapsulate basic first-order relations as the basic unit, and construct several Object-centric Triplet Attention Graphs (OTAG) to infer multi-order relations for every target object. The updated node features from OTAG are aggregated and fed into the caption decoder to provide abundant relational cues, so that captions including diverse relations with context objects can be generated. Extensive experiments on the Scan2Cap dataset prove the effectiveness of our proposed MORE and its components, and we also outperform the current state-of-the-art method. Our code is available at https://github.com/SxJyJay/MORE.

下载PDF全文

下载文献需遵守相关版权规定

论文标题