解决强化学习的顺序建模中的乐观偏见

论文标题

解决强化学习的顺序建模中的乐观偏见

Addressing Optimism Bias in Sequence Modeling for Reinforcement Learning

论文作者

Villaflor, Adam, Huang, Zhe, Pande, Swapnil, Dolan, John, Schneider, Jeff

论文摘要

基于变压器神经网络体系结构的自然语言处理（NLP）的令人印象深刻的结果激发了研究人员探索视线离线增强学习（RL）作为通用序列建模问题。基于此范式的最新著作已实现了最先进的结果，从而导致了几个确定性的离线atari和D4RL基准。但是，由于这些方法将国家和行动共同将其作为单个测序问题建模，因此它们努力将政策和世界动力学对回报的影响分解。因此，在对抗或随机环境中，这些方法导致过度乐观的行为，这些行为在自主驾驶（例如自主驾驶）中可能是危险的。在这项工作中，我们提出了一种通过明确解开政策和世界模型来解决这种乐观偏见的方法，该方法使我们在测试时可以搜索对环境中多个可能的未来的稳健性的策略。我们在模拟中的各种自动驾驶任务上展示了我们的方法的出色性能。

Impressive results in natural language processing (NLP) based on the Transformer neural network architecture have inspired researchers to explore viewing offline reinforcement learning (RL) as a generic sequence modeling problem. Recent works based on this paradigm have achieved state-of-the-art results in several of the mostly deterministic offline Atari and D4RL benchmarks. However, because these methods jointly model the states and actions as a single sequencing problem, they struggle to disentangle the effects of the policy and world dynamics on the return. Thus, in adversarial or stochastic environments, these methods lead to overly optimistic behavior that can be dangerous in safety-critical systems like autonomous driving. In this work, we propose a method that addresses this optimism bias by explicitly disentangling the policy and world models, which allows us at test time to search for policies that are robust to multiple possible futures in the environment. We demonstrate our method's superior performance on a variety of autonomous driving tasks in simulation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题