论文标题
任务未指定在评估深度强化学习中的影响
The Impact of Task Underspecification in Evaluating Deep Reinforcement Learning
论文作者
论文摘要
深入增强学习(DRL)方法的评估是该领域科学进步的组成部分。除了设计用于通用智能的DRL方法外,设计特定于任务的方法在现实世界中越来越突出。在这些设置中,标准评估实践涉及使用Markov决策过程(MDP)的一些实例来表示任务。但是,由于基础环境中的变化,尤其是在现实世界中,许多任务会导致大型MDP家族。例如,在交通信号控制中,变化可能源于交叉点几何和交通流量水平。因此,精选的MDP实例可能会无意间引起过度拟合,缺乏统计能力来得出有关该方法在整个家庭中的真实表现的结论。在本文中,我们增加了DRL评估以考虑MDP的参数化家族。我们表明,与评估某些MDP实例上的DRL方法相比,评估MDP家族通常会产生方法的相对相对排名,因此对应视为最新方法的方法提出了疑问。我们在标准控制基准和交通信号控制的实际应用中验证了这一现象。同时,我们表明对MDP家族的准确评估是不平凡的。总体而言,这项工作确定了在强化学习方面的经验严格挑战,尤其是当DRL的结果进入下游决策时。
Evaluations of Deep Reinforcement Learning (DRL) methods are an integral part of scientific progress of the field. Beyond designing DRL methods for general intelligence, designing task-specific methods is becoming increasingly prominent for real-world applications. In these settings, the standard evaluation practice involves using a few instances of Markov Decision Processes (MDPs) to represent the task. However, many tasks induce a large family of MDPs owing to variations in the underlying environment, particularly in real-world contexts. For example, in traffic signal control, variations may stem from intersection geometries and traffic flow levels. The select MDP instances may thus inadvertently cause overfitting, lacking the statistical power to draw conclusions about the method's true performance across the family. In this article, we augment DRL evaluations to consider parameterized families of MDPs. We show that in comparison to evaluating DRL methods on select MDP instances, evaluating the MDP family often yields a substantially different relative ranking of methods, casting doubt on what methods should be considered state-of-the-art. We validate this phenomenon in standard control benchmarks and the real-world application of traffic signal control. At the same time, we show that accurately evaluating on an MDP family is nontrivial. Overall, this work identifies new challenges for empirical rigor in reinforcement learning, especially as the outcomes of DRL trickle into downstream decision-making.