通过知情政策正则化学习动态环境中的自适应探索策略

论文标题

通过知情政策正则化学习动态环境中的自适应探索策略

Learning Adaptive Exploration Strategies in Dynamic Environments Through Informed Policy Regularization

论文作者

Kamienny, Pierre-Alexandre, Pirotta, Matteo, Lazaric, Alessandro, Lavril, Thibault, Usunier, Nicolas, Denoyer, Ludovic

论文摘要

我们研究了学习探索探索策略的问题，该策略有效地适应了动态环境，该任务可能会随着时间而变化。尽管基于RNN的政策原则上可以代表此类策略，但实际上，他们的培训时间是过于良好的，并且学习过程通常会融合到差的解决方案上。在本文中，我们考虑了代理在培训时间访问任务（例如任务ID或任务参数）的描述的情况，但在测试时间不可能。我们提出了一种新颖的算法，该算法使用经过培训的知情政策对基于RNN的政策进行培训，以最大程度地提高每个任务的奖励。这大大降低了基于RNN的策略的样本复杂性，而不会失去其代表性。结果，我们的方法学习了探索策略，这些勘探策略在收集有关未知任务和变化任务的信息之间有效平衡，并随着时间的推移最大化奖励。我们在各种环境中测试算法的性能，在各种环境中，任务在每个情节中可能会有所不同。

We study the problem of learning exploration-exploitation strategies that effectively adapt to dynamic environments, where the task may change over time. While RNN-based policies could in principle represent such strategies, in practice their training time is prohibitive and the learning process often converges to poor solutions. In this paper, we consider the case where the agent has access to a description of the task (e.g., a task id or task parameters) at training time, but not at test time. We propose a novel algorithm that regularizes the training of an RNN-based policy using informed policies trained to maximize the reward in each task. This dramatically reduces the sample complexity of training RNN-based policies, without losing their representational power. As a result, our method learns exploration strategies that efficiently balance between gathering information about the unknown and changing task and maximizing the reward over time. We test the performance of our algorithm in a variety of environments where tasks may vary within each episode.

下载PDF全文

下载文献需遵守相关版权规定

论文标题