安全的政策改进方法及其局限性

论文标题

安全的政策改进方法及其局限性

Safe Policy Improvement Approaches and their Limitations

论文作者

Scholl, Philipp, Dietrich, Felix, Otte, Clemens, Udluft, Steffen

论文摘要

安全政策改进（SPI）是在安全关键应用中脱机加强学习的重要技术，因为它以很高的可能性改善了行为政策。我们根据如何利用国家行动对的不确定性将各种SPI方法分为两组。为了关注软SPIBB（通过软基线自举的安全政策改进）算法，我们表明他们对被证明安全的主张没有成立。基于这一发现，我们开发了适应性，Adv-Soft SpibB算法，并证明它们是可以安全的。在两个基准上进行的广泛实验中，启发式适应性较低的SPOFBB在所有SPIBB算法中都能表现出最佳性能。我们还检查了可证明的安全算法的安全保证，并表明有大量数据是必要的，以使安全界限在实践中变得有用。

Safe Policy Improvement (SPI) is an important technique for offline reinforcement learning in safety critical applications as it improves the behavior policy with a high probability. We classify various SPI approaches from the literature into two groups, based on how they utilize the uncertainty of state-action pairs. Focusing on the Soft-SPIBB (Safe Policy Improvement with Soft Baseline Bootstrapping) algorithms, we show that their claim of being provably safe does not hold. Based on this finding, we develop adaptations, the Adv-Soft-SPIBB algorithms, and show that they are provably safe. A heuristic adaptation, Lower-Approx-Soft-SPIBB, yields the best performance among all SPIBB algorithms in extensive experiments on two benchmarks. We also check the safety guarantees of the provably safe algorithms and show that huge amounts of data are necessary such that the safety bounds become useful in practice.

下载PDF全文

下载文献需遵守相关版权规定

论文标题