论文标题
想象从感知信息中的构想概念表示
Imagining Grounded Conceptual Representations from Perceptual Information in Situated Guessing Games
论文作者
论文摘要
在视觉猜测游戏中,猜测者必须通过向Oracle提出问题来识别场景中的目标对象。玩家的有效策略是学习既有歧视性又表现力的对象的概念表示,以提出问题并正确猜测。但是,如Suglia等人所示。 (2020年),现有模型无法学习真正的多模式表示,而是依靠金类别标签来训练和推理时间的现场对象。当推理时间匹配训练时间的类别时,这提供了不自然的性能优势,并且导致模型在涉及室外对象类别的更现实的“零射击”方案中失败。为了克服这个问题,我们介绍了一个基于正规自动编码器的小说“想象”模块,该模块学习了上下文感知和类别意识到的潜在嵌入,而无需在推理时依赖类别标签。我们的想象力模块的表现优于最先进的竞争对手,在Compguess中,游戏的精度为8.26%?零射击场景(Suglia等,2020),它在猜测中提高了甲骨文和猜测的准确性2.08%和12.86%?基准,当推理时间没有黄金类别时。想象模块还提高了有关对象属性和属性的推理。
In visual guessing games, a Guesser has to identify a target object in a scene by asking questions to an Oracle. An effective strategy for the players is to learn conceptual representations of objects that are both discriminative and expressive enough to ask questions and guess correctly. However, as shown by Suglia et al. (2020), existing models fail to learn truly multi-modal representations, relying instead on gold category labels for objects in the scene both at training and inference time. This provides an unnatural performance advantage when categories at inference time match those at training time, and it causes models to fail in more realistic "zero-shot" scenarios where out-of-domain object categories are involved. To overcome this issue, we introduce a novel "imagination" module based on Regularized Auto-Encoders, that learns context-aware and category-aware latent embeddings without relying on category labels at inference time. Our imagination module outperforms state-of-the-art competitors by 8.26% gameplay accuracy in the CompGuessWhat?! zero-shot scenario (Suglia et al., 2020), and it improves the Oracle and Guesser accuracy by 2.08% and 12.86% in the GuessWhat?! benchmark, when no gold categories are available at inference time. The imagination module also boosts reasoning about object properties and attributes.