积极学习加强学习的昂贵奖励功能

论文标题

积极学习加强学习的昂贵奖励功能

Actively Learning Costly Reward Functions for Reinforcement Learning

论文作者

Eberhard, André, Metni, Houssam, Fahland, Georg, Stroh, Alexander, Friederich, Pascal

论文摘要

高数据需求和效率低下和可扩展性的阻碍，深入强化学习到现实世界应用的最新进展转移。通过独立改进诸如重播缓冲区或更稳定的学习算法的组件，以及通过大量分布式系统，可以将培训时间从几天减少到几个小时，以完成标准的基准任务。但是，尽管模拟环境中的奖励是明确的且易于计算的，但奖励评估成为许多现实世界环境中的瓶颈，例如，在分子优化任务中，计算要求进行计算要求或什至实验需要评估状态并量化奖励。因此，如果没有大量的计算资源和时间，培训可能会变得非常昂贵。我们建议通过通过神经网络建立的奖励代替昂贵的基础真相奖励来减轻这个问题，从而在培训中与积极的学习组件进行培训期间的非平稳性和奖励分布。我们证明，使用我们提出的ACRL方法（积极学习加强学习的昂贵奖励），可以在复杂的现实环境环境中训练代理更快地训练代理。通过促进增强学习方法到新领域的应用，我们表明我们可以找到有趣且非平凡的解决方案，以解决化学，材料科学和工程学中现实世界中优化问题。

Transfer of recent advances in deep reinforcement learning to real-world applications is hindered by high data demands and thus low efficiency and scalability. Through independent improvements of components such as replay buffers or more stable learning algorithms, and through massively distributed systems, training time could be reduced from several days to several hours for standard benchmark tasks. However, while rewards in simulated environments are well-defined and easy to compute, reward evaluation becomes the bottleneck in many real-world environments, e.g., in molecular optimization tasks, where computationally demanding simulations or even experiments are required to evaluate states and to quantify rewards. Therefore, training might become prohibitively expensive without an extensive amount of computational resources and time. We propose to alleviate this problem by replacing costly ground-truth rewards with rewards modeled by neural networks, counteracting non-stationarity of state and reward distributions during training with an active learning component. We demonstrate that using our proposed ACRL method (Actively learning Costly rewards for Reinforcement Learning), it is possible to train agents in complex real-world environments orders of magnitudes faster. By enabling the application of reinforcement learning methods to new domains, we show that we can find interesting and non-trivial solutions to real-world optimization problems in chemistry, materials science and engineering.

下载PDF全文

下载文献需遵守相关版权规定

论文标题