论文标题
Avlen:在3D环境中的视听语言体现导航
AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments
论文作者
论文摘要
近年来,人们在两个不同的方向上看到了体现的视觉导航进展:(i)为AI代理遵循自然语言指令,以及(ii)使可通道的世界多模式,例如音频 - 视觉导航。但是,现实世界不仅是多模式的,而且通常是复杂的,因此尽管有这些进步,代理仍然需要了解其行动中的不确定性并寻求导航的指示。为此,我们提出了Avlen〜-用于视听语言体现的导航的交互式代理。与视听导航任务类似,我们体现的代理的目标是通过导航3D视觉世界来定位音频事件;但是,代理商还可以寻求人类(甲骨文)的帮助,在该人(甲骨文)中,以自由形式的自然语言提供援助。为了实现这些能力,Avlen使用多模式的层次强化学习骨干组织,该骨干可以学习:(a)高级政策选择进行导航或查询甲骨文的高级政策,以及(b)基于其音频访问和语言输入的导航操作的低级政策。这些政策是通过奖励导航任务的奖励来培训的,同时最大程度地减少了对Oracle的查询数量。为了经验评估Avlen,我们在Soundspaces框架上介绍了语义音频导航任务的实验。我们的结果表明,装备代理商寻求帮助会导致性能明显改善,尤其是在具有挑战性的情况下,例如,当声音在训练期间或在有干扰器声音的情况下闻所未闻时。
Recent years have seen embodied visual navigation advance in two distinct directions: (i) in equipping the AI agent to follow natural language instructions, and (ii) in making the navigable world multimodal, e.g., audio-visual navigation. However, the real world is not only multimodal, but also often complex, and thus in spite of these advances, agents still need to understand the uncertainty in their actions and seek instructions to navigate. To this end, we present AVLEN~ -- an interactive agent for Audio-Visual-Language Embodied Navigation. Similar to audio-visual navigation tasks, the goal of our embodied agent is to localize an audio event via navigating the 3D visual world; however, the agent may also seek help from a human (oracle), where the assistance is provided in free-form natural language. To realize these abilities, AVLEN uses a multimodal hierarchical reinforcement learning backbone that learns: (a) high-level policies to choose either audio-cues for navigation or to query the oracle, and (b) lower-level policies to select navigation actions based on its audio-visual and language inputs. The policies are trained via rewarding for the success on the navigation task while minimizing the number of queries to the oracle. To empirically evaluate AVLEN, we present experiments on the SoundSpaces framework for semantic audio-visual navigation tasks. Our results show that equipping the agent to ask for help leads to a clear improvement in performance, especially in challenging cases, e.g., when the sound is unheard during training or in the presence of distractor sounds.