论文标题
通过减轻强化学习的瓶颈切换到歧视图像字幕
Switching to Discriminative Image Captioning by Relieving a Bottleneck of Reinforcement Learning
论文作者
论文摘要
歧视性是图像标题的理想特征:字幕应描述输入图像的特征细节。但是,尽管在其他各种标准中,近期接受加固学习(RL)训练的高性能字幕模型(RL)倾向于产生过于通用的字幕。首先,我们研究了出乎意料的低歧视性的原因,并表明RL具有将输出单词限制在高频单词中的根深蒂固的副作用。有限的词汇量是一种严重的瓶颈,因为模型很难描述超出其词汇的细节。然后,基于对瓶颈的这种识别,我们将大量重新施加判别图像字幕作为鼓励低频词生成的更简单任务。在长尾分类和偏见方法的暗示下,我们提出了一种方法,这些方法可以轻松地将现成的RL模型切换为歧视性感知的模型,仅在参数的一部分中具有单上音微调。广泛的实验表明,我们的方法显着提高了现成的RL模型的歧视性,甚至超过以前的歧视性意识方法,其计算成本较小。详细的分析和人类评估还验证了我们的方法是否提高了歧视性,而无需牺牲标题的整体质量。
Discriminativeness is a desirable feature of image captions: captions should describe the characteristic details of input images. However, recent high-performing captioning models, which are trained with reinforcement learning (RL), tend to generate overly generic captions despite their high performance in various other criteria. First, we investigate the cause of the unexpectedly low discriminativeness and show that RL has a deeply rooted side effect of limiting the output words to high-frequency words. The limited vocabulary is a severe bottleneck for discriminativeness as it is difficult for a model to describe the details beyond its vocabulary. Then, based on this identification of the bottleneck, we drastically recast discriminative image captioning as a much simpler task of encouraging low-frequency word generation. Hinted by long-tail classification and debiasing methods, we propose methods that easily switch off-the-shelf RL models to discriminativeness-aware models with only a single-epoch fine-tuning on the part of the parameters. Extensive experiments demonstrate that our methods significantly enhance the discriminativeness of off-the-shelf RL models and even outperform previous discriminativeness-aware methods with much smaller computational costs. Detailed analysis and human evaluation also verify that our methods boost the discriminativeness without sacrificing the overall quality of captions.