论文标题
变性变压器:超出图像字幕精确度和多样性之间的权衡取舍的框架
Variational Transformer: A Framework Beyond the Trade-off between Accuracy and Diversity for Image Captioning
论文作者
论文摘要
准确性和多样性是产生自然和语义上正确标题的两种必需的METRILILIAL SUNIGETS。由于权衡差距,已经做出了许多努力,以增强其中的一个。在这项工作中,我们将证明,从人类注释中得出的较低准确性标准(保留一个输出)不适用于机器生成的字幕。为了通过稳定的精度性能提高多样性,我们利用了一种新颖的变化变压器框架。通过引入“不可见的信息先验”和“自动选择的GMM”,我们指示编码器在不同场景中学习精确的语言信息和对象关系以确保准确性。通过引入“ Range-Median奖励”基线,我们在基于RL的多样性保证培训过程中保留了更加多样化的候选人,并具有更高的奖励。实验表明,我们的方法可以同时促进准确性(cider)和多样性(自助),高达1.1%和4.8%。此外,与人类注释相比,我们的方法具有最相似的语义检索性能,R@1(I2T)为50.3(50.6)。
Accuracy and Diversity are two essential metrizable manifestations in generating natural and semantically correct captions. Many efforts have been made to enhance one of them with another decayed due to the trade-off gap. In this work, we will show that the inferior standard of accuracy draws from human annotations (leave-one-out) are not appropriate for machine-generated captions. To improve diversity with a solid accuracy performance, we exploited a novel Variational Transformer framework. By introducing the "Invisible Information Prior" and the "Auto-selectable GMM", we instruct the encoder to learn the precise language information and object relation in different scenes for accuracy assurance. By introducing the "Range-Median Reward" baseline, we retain more diverse candidates with higher rewards during the RL-based training process for diversity assurance. Experiments show that our method achieves the simultaneous promotion of accuracy (CIDEr) and diversity (self-CIDEr), up to 1.1 and 4.8 percent. Also, our method got the most similar performance of the semantic retrieval compared to human annotations, with 50.3 (50.6 of human) for R@1(i2t).