SUS-X：视觉模型的仅训练的仅训练名称转移

论文标题

SUS-X：视觉模型的仅训练的仅训练名称转移

SuS-X: Training-Free Name-Only Transfer of Vision-Language Models

论文作者

Udandarao, Vishaal, Gupta, Ankush, Albanie, Samuel

论文摘要

对比语言图像预训练（剪辑）已成为一种训练大规模视觉模型的简单但有效的方法。剪辑显示出令人印象深刻的零射击分类，并在不同的下游任务上取回。但是，为了利用其全部潜力，似乎仍然需要微调。微调整个剪辑模型可能是资源密集且不稳定的。此外，旨在规避这种对微调需求的最新方法仍然需要从目标分布中访问图像。在本文中，我们采用了另一种方法，并探讨了无培训的“仅名称转移”的制度，在这些方案中，我们对下游任务所拥有的唯一知识包括下游目标类别的名称。我们提出了一种新颖的方法，SUS-X，由两个关键的构建块组成：SUS和TIP-X，这既不需要密集的微调或昂贵的标签数据。 SUS-X在19个基准数据集上实现了最新的零摄像分类结果。我们进一步显示了TIP-X在无训练的几杆环境中的实用性，在这里我们再次通过强大的无训练基线实现了最先进的结果。代码可在https://github.com/vishaal27/sus-x上找到。

Contrastive Language-Image Pre-training (CLIP) has emerged as a simple yet effective way to train large-scale vision-language models. CLIP demonstrates impressive zero-shot classification and retrieval on diverse downstream tasks. However, to leverage its full potential, fine-tuning still appears to be necessary. Fine-tuning the entire CLIP model can be resource-intensive and unstable. Moreover, recent methods that aim to circumvent this need for fine-tuning still require access to images from the target distribution. In this paper, we pursue a different approach and explore the regime of training-free "name-only transfer" in which the only knowledge we possess about the downstream task comprises the names of downstream target categories. We propose a novel method, SuS-X, consisting of two key building blocks -- SuS and TIP-X, that requires neither intensive fine-tuning nor costly labelled data. SuS-X achieves state-of-the-art zero-shot classification results on 19 benchmark datasets. We further show the utility of TIP-X in the training-free few-shot setting, where we again achieve state-of-the-art results over strong training-free baselines. Code is available at https://github.com/vishaal27/SuS-X.

下载PDF全文

下载文献需遵守相关版权规定

论文标题