论文标题
CALIP:零射击夹的增强,无参数注意
CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention
论文作者
论文摘要
对比性语言图像预训练(CLIP)已显示出具有出色传递性的视觉表示,这可以实现零拍的有希望的准确性。为了进一步提高其下游性能,现有作品在剪辑上提出了其他可学习的模块,并通过几次射击训练集对其进行微调。但是,由此产生的额外培训成本和数据要求严重阻碍了模型部署和知识转移的效率。在本文中,我们引入了一种自由午餐的增强方法CALIP,以通过无参数注意模块来提高Clip的零拍摄性能。具体而言,我们指导视觉和文本表示相互交互,并通过注意探索跨模式的信息特征。由于预训练在很大程度上降低了两种方式之间的嵌入距离,因此我们在注意力和双向更新多模式特征中丢弃所有可学习的参数,从而使整个过程无参数且无培训。通过这种方式,图像与文本感知信号混合在一起,文本表示形式被视觉引导以获得更好的自适应零射击对齐。我们在14个数据集的各种基准测试中评估了CALIP,以用于2D图像和3D Point Cloud几乎没有分类,显示出一致的零弹性性能改善了夹子。基于此,我们在Calip的注意模块中进一步插入了少量线性层,并在几个弹药设置下验证了我们的鲁棒性,与现有方法相比,这也可以实现领先的性能。这些广泛的实验证明了我们方法在有效增强剪辑方面的优势。
Contrastive Language-Image Pre-training (CLIP) has been shown to learn visual representations with great transferability, which achieves promising accuracy for zero-shot classification. To further improve its downstream performance, existing works propose additional learnable modules upon CLIP and fine-tune them by few-shot training sets. However, the resulting extra training cost and data requirement severely hinder the efficiency for model deployment and knowledge transfer. In this paper, we introduce a free-lunch enhancement method, CALIP, to boost CLIP's zero-shot performance via a parameter-free Attention module. Specifically, we guide visual and textual representations to interact with each other and explore cross-modal informative features via attention. As the pre-training has largely reduced the embedding distances between two modalities, we discard all learnable parameters in the attention and bidirectionally update the multi-modal features, enabling the whole process to be parameter-free and training-free. In this way, the images are blended with textual-aware signals and the text representations become visual-guided for better adaptive zero-shot alignment. We evaluate CALIP on various benchmarks of 14 datasets for both 2D image and 3D point cloud few-shot classification, showing consistent zero-shot performance improvement over CLIP. Based on that, we further insert a small number of linear layers in CALIP's attention module and verify our robustness under the few-shot settings, which also achieves leading performance compared to existing methods. Those extensive experiments demonstrate the superiority of our approach for efficient enhancement of CLIP.