论文标题

使用内容选择模型从Alt-Text数据授予大规模图像字幕

Denoising Large-Scale Image Captioning from Alt-text Data using Content Selection Models

论文作者

Chandu, Khyathi Raghavi, Sharma, Piyush, Changpinyo, Soravit, Thapliyal, Ashish, Soricut, Radu

论文摘要

培训大规模的图像字幕(IC)模型需要访问从野外收集的丰富而多样的培训示例,通常是从嘈杂的Alt-Text数据中收集的。但是,在这种情况下,最近对IC的建模方法通常在性能方面缺乏,因为它们假设了一个干净的注释数据集(而不是基于Alt-Alt-Text的注释),并且采用了端到端的生成方法,通常缺乏可控性和可解释性。我们通过将任务分解为两个更简单,更可控的任务(骨架预测和基于骨架的字幕生成)来解决这些问题。具体来说,我们表明,在利用丰富但嘈杂的基于Alt-Text的未修剪数据集时,选择内容词作为骨骼}有助于生成改进和变形的字幕。我们还表明,可以进一步利用预测的英国骨骼来产生非英语字幕,并提供实验结果,涵盖法语,意大利语,德语,西班牙语和印地语的字幕产生。我们还表明,基于骨架的预测可以更好地控制某些字幕属性,例如长度,内容和性别表达,从而提供了执行人类在环境半自动校正的手柄。

Training large-scale image captioning (IC) models demands access to a rich and diverse set of training examples, gathered from the wild, often from noisy alt-text data. However, recent modeling approaches to IC often fall short in terms of performance in this case, because they assume a clean annotated dataset (as opposed to the noisier alt-text--based annotations), and employ an end-to-end generation approach, which often lacks both controllability and interpretability. We address these problems by breaking down the task into two simpler, more controllable tasks -- skeleton prediction and skeleton-based caption generation. Specifically, we show that selecting content words as skeletons} helps in generating improved and denoised captions when leveraging rich yet noisy alt-text--based uncurated datasets. We also show that the predicted English skeletons can be further cross-lingually leveraged to generate non-English captions, and present experimental results covering caption generation in French, Italian, German, Spanish and Hindi. We also show that skeleton-based prediction allows for better control of certain caption properties, such as length, content, and gender expression, providing a handle to perform human-in-the-loop semi-automatic corrections.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源