共同信息差异：多模式生成模型的统一度量

论文标题

共同信息差异：多模式生成模型的统一度量

Mutual Information Divergence: A Unified Metric for Multimodal Generative Models

论文作者

Kim, Jin-Hwa, Kim, Yunji, Lee, Jiyoung, Yoo, Kang Min, Lee, Sang-Woo

论文摘要

文本对图像生成和图像字幕最近作为评估机器智能的新实验范式出现。他们预测连续数量伴随着一代中的采样技术，使评估变得复杂且棘手以获得边际分布。基于最近的趋势，即多模式生成评估利用了一个粘面和语言预训练的模型，我们提出了使用剪辑特征作为统一的度量标准，通过相互信息差异（中）提出了负值高斯交叉杂音信息。为了验证，我们将其与竞争指标进行了广泛的比较，该指标使用文本到图像生成和图像字幕任务的精心生成或人为宣传的判断。提出的中间通过在基准，样本简约和对被剥削的剪辑模型的稳健性之间具有一致性来显着优于竞争方法。我们期待看到高斯交叉杂音信息在多模式表示学习和基于这个新颖的命题的未来作品中的代表性不足。

Text-to-image generation and image captioning are recently emerged as a new experimental paradigm to assess machine intelligence. They predict continuous quantity accompanied by their sampling techniques in the generation, making evaluation complicated and intractable to get marginal distributions. Based on a recent trend that multimodal generative evaluations exploit a vison-and-language pre-trained model, we propose the negative Gaussian cross-mutual information using the CLIP features as a unified metric, coined by Mutual Information Divergence (MID). To validate, we extensively compare it with competing metrics using carefully-generated or human-annotated judgments in text-to-image generation and image captioning tasks. The proposed MID significantly outperforms the competitive methods by having consistency across benchmarks, sample parsimony, and robustness toward the exploited CLIP model. We look forward to seeing the underrepresented implications of the Gaussian cross-mutual information in multimodal representation learning and the future works based on this novel proposition.

下载PDF全文

下载文献需遵守相关版权规定

论文标题