论文标题

通过奖励设计可接受的政策教学

Admissible Policy Teaching through Reward Design

论文作者

Banihashem, Kiarash, Singla, Adish, Gan, Jiarui, Radanovic, Goran

论文摘要

我们研究奖励设计策略,以激励强化学习代理人从一系列可接受的政策中采用政策。奖励设计师的目标是修改基本奖励功能的成本效益,同时确保新奖励功能下的任何近似最佳的确定性政策都是可以接受的,并且在原始奖励功能下表现良好。这个问题可以被视为最佳奖励中毒攻击问题的双重问题:奖励设计师没有强迫代理人采用特定的政策,而是激励代理商避免采取某些州不可毫无疑问的行动。也许令人惊讶的是,与最佳奖励中毒攻击的问题相反,我们首先表明,可接受的政策教学的奖励设计问题在计算上是具有挑战性的,并且找到大约最佳的奖励修改是NP-HARD。然后,我们通过制定一个替代问题来进行,其最佳解决方案近似于我们环境中奖励设计问题的最佳解决方案,但更适合优化技术和分析。对于这个替代问题,我们提出了表征结果,该结果为最佳解决方案的值提供了界限。最后,我们设计了一种本地搜索算法来解决替代问题并使用基于仿真的实验来展示其实用程序。

We study reward design strategies for incentivizing a reinforcement learning agent to adopt a policy from a set of admissible policies. The goal of the reward designer is to modify the underlying reward function cost-efficiently while ensuring that any approximately optimal deterministic policy under the new reward function is admissible and performs well under the original reward function. This problem can be viewed as a dual to the problem of optimal reward poisoning attacks: instead of forcing an agent to adopt a specific policy, the reward designer incentivizes an agent to avoid taking actions that are inadmissible in certain states. Perhaps surprisingly, and in contrast to the problem of optimal reward poisoning attacks, we first show that the reward design problem for admissible policy teaching is computationally challenging, and it is NP-hard to find an approximately optimal reward modification. We then proceed by formulating a surrogate problem whose optimal solution approximates the optimal solution to the reward design problem in our setting, but is more amenable to optimization techniques and analysis. For this surrogate problem, we present characterization results that provide bounds on the value of the optimal solution. Finally, we design a local search algorithm to solve the surrogate problem and showcase its utility using simulation-based experiments.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源