论文标题

拓扑的设计,用于对比视觉文本对齐

Design of the topology for contrastive visual-textual alignment

论文作者

Sun, Zhun

论文摘要

余弦相似性是测量对比视觉文本对准学习中特征表示之间距离的共同选择。但是,从经验上学习大规模嘈杂训练数据时,需要学习的软度温度参数。在这项工作中,我们首先讨论了嵌入空间的拓扑特性的软峰温度的作用。我们认为,软磁性温度是嘈杂训练数据对比度学习的关键机制。它充当距离范围的缩放因子(例如[-1,1]对于余弦相似性),其学习的值表示训练数据中的噪声水平。然后,我们为嵌入对齐方式提出了拓扑的替代设计。我们在变压器体系结构中使用多个类令牌;然后将特征表示形式映射到倾斜的歧管上,该斜面具有负面的内部产品作为距离函数。通过这种配置,我们在很大程度上提高了在大规模数据集中预先训练的基线剪辑模型的零摄像分类性能,平均平均为6.1 \%。

Cosine similarity is the common choice for measuring the distance between the feature representations in contrastive visual-textual alignment learning. However, empirically a learnable softmax temperature parameter is required when learning on large-scale noisy training data. In this work, we first discuss the role of softmax temperature from the embedding space's topological properties. We argue that the softmax temperature is the key mechanism for contrastive learning on noisy training data. It acts as a scaling factor of the distance range (e.g. [-1, 1] for the cosine similarity), and its learned value indicates the level of noise in the training data. Then, we propose an alternative design of the topology for the embedding alignment. We make use of multiple class tokens in the transformer architecture; then map the feature representations onto an oblique manifold endowed with the negative inner product as the distance function. With this configuration, we largely improve the zero-shot classification performance of baseline CLIP models pre-trained on large-scale datasets by an average of 6.1\%.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源