原型：原型对比语言图像预处理

论文标题

原型：原型对比语言图像预处理

ProtoCLIP: Prototypical Contrastive Language Image Pretraining

论文作者

Chen, Delong, Wu, Zhao, Liu, Fan, Yang, Zaiquan, Huang, Huaxi, Tan, Ying, Zhou, Erjin

论文摘要

对比性语言图像预处理（剪辑）已受到广泛关注，因为它的学会表示形式可以很好地转移到各种下游任务中。在剪辑模型的训练过程中，Infonce物镜将正面的图像文本对对齐并分开负面。在此过程中，我们显示了基本表示的分组效应：通过随机出现的模式内锚固，Infonce客观间接地将语义相似的表示形式组合在一起。基于这种理解，在本文中，引入了原型的对比语言图像（原始语言），以提高其效率并提高其对模态差距的鲁棒性来增强这种分组。具体而言，原始利润在图像和文本空间之间建立了原型级别的歧视，从而有效传输了更高级别的结构知识。此外，提出了典型的背部翻译（PBT）将表示表示与表示形式的比对分组，从而有效地学习了在较大的模态差距下有意义的表示。 PBT还使我们能够以更丰富的先前语言知识介绍其他外部教师。 ProteOclip通过在线情节培训策略进行了培训，这可以扩展到无限量的数据。我们在概念标题上训练原脂，并获得了 +5.81％的成像网线性探测改进，并且 +2.01％的成像网零摄像机分类改进。在较大的YFCC-15M数据集中，原始PLIP与33％的训练时间相匹配夹具的性能。代码可在https://github.com/megvii-research/protoclip上找到。

Contrastive Language Image Pretraining (CLIP) has received widespread attention, since its learned representations can be transferred well to various downstream tasks. During the training process of the CLIP model, the InfoNCE objective aligns positive image-text pairs and separates negative ones. We show an underlying representation grouping effect during this process: the InfoNCE objective indirectly groups semantically similar representations together via randomly emerged within-modal anchors. Based on this understanding, in this paper, Prototypical Contrastive Language Image Pretraining (ProtoCLIP) is introduced to enhance such grouping by boosting its efficiency and increasing its robustness against the modality gap. Specifically, ProtoCLIP sets up prototype-level discrimination between image and text spaces, which efficiently transfers higher-level structural knowledge. Further, Prototypical Back Translation (PBT) is proposed to decouple representation grouping from representation alignment, resulting in effective learning of meaningful representations under large modality gap. The PBT also enables us to introduce additional external teachers with richer prior language knowledge. ProtoCLIP is trained with an online episodic training strategy, which makes it can be scaled up to unlimited amounts of data. We train our ProtoCLIP on Conceptual Captions and achieved an +5.81% ImageNet linear probing improvement and an +2.01% ImageNet zero-shot classification improvement. On the larger YFCC-15M dataset, ProtoCLIP matches the performance of CLIP with 33% of training time. Codes are available at https://github.com/megvii-research/protoclip.

下载PDF全文

下载文献需遵守相关版权规定

论文标题