使用内容选择模型从Alt-Text数据授予大规模图像字幕

论文标题

使用内容选择模型从Alt-Text数据授予大规模图像字幕

Denoising Large-Scale Image Captioning from Alt-text Data using Content Selection Models

论文作者

Chandu, Khyathi Raghavi, Sharma, Piyush, Changpinyo, Soravit, Thapliyal, Ashish, Soricut, Radu

论文摘要

培训大规模的图像字幕（IC）模型需要访问从野外收集的丰富而多样的培训示例，通常是从嘈杂的Alt-Text数据中收集的。但是，在这种情况下，最近对IC的建模方法通常在性能方面缺乏，因为它们假设了一个干净的注释数据集（而不是基于Alt-Alt-Text的注释），并且采用了端到端的生成方法，通常缺乏可控性和可解释性。我们通过将任务分解为两个更简单，更可控的任务（骨架预测和基于骨架的字幕生成）来解决这些问题。具体来说，我们表明，在利用丰富但嘈杂的基于Alt-Text的未修剪数据集时，选择内容词作为骨骼}有助于生成改进和变形的字幕。我们还表明，可以进一步利用预测的英国骨骼来产生非英语字幕，并提供实验结果，涵盖法语，意大利语，德语，西班牙语和印地语的字幕产生。我们还表明，基于骨架的预测可以更好地控制某些字幕属性，例如长度，内容和性别表达，从而提供了执行人类在环境半自动校正的手柄。

Training large-scale image captioning (IC) models demands access to a rich and diverse set of training examples, gathered from the wild, often from noisy alt-text data. However, recent modeling approaches to IC often fall short in terms of performance in this case, because they assume a clean annotated dataset (as opposed to the noisier alt-text--based annotations), and employ an end-to-end generation approach, which often lacks both controllability and interpretability. We address these problems by breaking down the task into two simpler, more controllable tasks -- skeleton prediction and skeleton-based caption generation. Specifically, we show that selecting content words as skeletons} helps in generating improved and denoised captions when leveraging rich yet noisy alt-text--based uncurated datasets. We also show that the predicted English skeletons can be further cross-lingually leveraged to generate non-English captions, and present experimental results covering caption generation in French, Italian, German, Spanish and Hindi. We also show that skeleton-based prediction allows for better control of certain caption properties, such as length, content, and gender expression, providing a handle to perform human-in-the-loop semi-automatic corrections.

下载PDF全文

下载文献需遵守相关版权规定

论文标题