论文标题
眼镜蛇:对比度双模式表示算法
COBRA: Contrastive Bi-Modal Representation Algorithm
论文作者
论文摘要
涉及多模式数据的广泛应用程序,例如跨模式检索,视觉提问和图像字幕。此类应用主要取决于不同组成方式的对齐分布。现有方法通过在共同的歧管中代表它们,以共同的方式为每种方式产生潜在的嵌入。但是,这些关节嵌入空间无法充分缩小方式差距,这会影响下游任务的性能。我们假设这些嵌入者保留了类内部关系,但无法保留类间动力学。在本文中,我们提出了一个新颖的框架眼镜蛇,旨在以一个受对比性预测编码(CPC)和噪声对比估计(NCE)范式启发的联合方式训练两种方式(图像和文本),这些范围既可以保留跨层和内部的关系。我们从经验上表明,该框架大大降低了模态差距,并产生了强大而任务的不可知的关节空间。我们在跨越七个基准跨模式数据集的四个不同下游任务上的现有工作胜过现有的工作。
There are a wide range of applications that involve multi-modal data, such as cross-modal retrieval, visual question-answering, and image captioning. Such applications are primarily dependent on aligned distributions of the different constituent modalities. Existing approaches generate latent embeddings for each modality in a joint fashion by representing them in a common manifold. However these joint embedding spaces fail to sufficiently reduce the modality gap, which affects the performance in downstream tasks. We hypothesize that these embeddings retain the intra-class relationships but are unable to preserve the inter-class dynamics. In this paper, we present a novel framework COBRA that aims to train two modalities (image and text) in a joint fashion inspired by the Contrastive Predictive Coding (CPC) and Noise Contrastive Estimation (NCE) paradigms which preserve both inter and intra-class relationships. We empirically show that this framework reduces the modality gap significantly and generates a robust and task agnostic joint-embedding space. We outperform existing work on four diverse downstream tasks spanning across seven benchmark cross-modal datasets.