生态增强学习

论文标题

生态增强学习

Ecological Reinforcement Learning

论文作者

Co-Reyes, John D., Sanjeev, Suvansh, Berseth, Glen, Gupta, Abhishek, Levine, Sergey

论文摘要

当前有关强化学习研究的许多工作情节环境，在试验之间将试验重置为初始状态分布之间，通常具有良好的奖励功能。非剧本设置，代理必须通过与世界的连续互动而学习，而代理只收到延迟和稀疏的奖励信号，这一点更加困难，但是可以说，考虑到现实世界的环境，可以说更现实地说明了学习者，并没有为学习者提供便利的“重置机制”和易于奖励的塑造。在本文中，我们没有研究可以解决这种非剧烈且稀疏的奖励设置的算法改进，而是研究可以使在这种情况下更容易学习的环境属性的种类。了解环境的财产如何影响强化学习剂的性能可以帮助我们以使学习能够进行的方式构造任务。我们首先讨论我们称为“环境塑造”的内容 - 对环境的修改，该环境提供了奖励塑造的替代方案，并且可能更容易实施。然后，我们讨论了一个更简单的属性，我们称之为“动态”，该属性描述了环境独立于代理的行为变化的程度，并且可以通过环境过渡熵来衡量。令人惊讶的是，我们发现即使是该物业也可以大大减轻与稀疏奖励环境中非剧本RL相关的挑战。我们对一组专注于非剧烈学习的新任务提供了经验评估。通过这项研究，我们希望将社区的重点转向分析环境的财产如何影响学习和通过RL学习的最终行为类型。

Much of the current work on reinforcement learning studies episodic settings, where the agent is reset between trials to an initial state distribution, often with well-shaped reward functions. Non-episodic settings, where the agent must learn through continuous interaction with the world without resets, and where the agent receives only delayed and sparse reward signals, is substantially more difficult, but arguably more realistic considering real-world environments do not present the learner with a convenient "reset mechanism" and easy reward shaping. In this paper, instead of studying algorithmic improvements that can address such non-episodic and sparse reward settings, we instead study the kinds of environment properties that can make learning under such conditions easier. Understanding how properties of the environment impact the performance of reinforcement learning agents can help us to structure our tasks in ways that make learning tractable. We first discuss what we term "environment shaping" -- modifications to the environment that provide an alternative to reward shaping, and may be easier to implement. We then discuss an even simpler property that we refer to as "dynamism," which describes the degree to which the environment changes independent of the agent's actions and can be measured by environment transition entropy. Surprisingly, we find that even this property can substantially alleviate the challenges associated with non-episodic RL in sparse reward settings. We provide an empirical evaluation on a set of new tasks focused on non-episodic learning with sparse rewards. Through this study, we hope to shift the focus of the community towards analyzing how properties of the environment can affect learning and the ultimate type of behavior that is learned via RL.

下载PDF全文

下载文献需遵守相关版权规定

论文标题