论文标题
关于奖励推断对未指定的人类模型的敏感性
On the Sensitivity of Reward Inference to Misspecified Human Models
论文作者
论文摘要
从人类行为中推断出奖励功能是价值一致性的中心 - 将AI目标与我们的人类真正想要的目标保持一致。但是这样做依赖于人类对目标的行为方式的模型。经过数十年的认知科学,神经科学和行为经济学的研究,获得了准确的人类模型仍然是一个开放的研究主题。这就提出了一个问题:为了使奖励推断变得准确,这些模型需要如何准确?一方面,如果模型中的小错误可能导致推理的灾难性错误,那么奖励学习的整个框架似乎是不好的,因为我们永远不会拥有完美的人类行为模型。另一方面,如果随着模型的改善,我们可以保证奖励准确性也会提高,这将显示出在建模方面的更多工作的好处。我们从理论和经验上研究了这个问题。我们确实表明,不幸的是,可以在行为中构建小小的对抗性偏见,从而导致推断奖励中任意大型错误。但是,可以说是更重要的是,我们还能够确定可以在人类模型中的误差中线性地界定奖励推理误差的合理假设。最后,我们通过模拟和人类数据验证了对离散和连续控制任务的理论见解。
Inferring reward functions from human behavior is at the center of value alignment - aligning AI objectives with what we, humans, actually want. But doing so relies on models of how humans behave given their objectives. After decades of research in cognitive science, neuroscience, and behavioral economics, obtaining accurate human models remains an open research topic. This begs the question: how accurate do these models need to be in order for the reward inference to be accurate? On the one hand, if small errors in the model can lead to catastrophic error in inference, the entire framework of reward learning seems ill-fated, as we will never have perfect models of human behavior. On the other hand, if as our models improve, we can have a guarantee that reward accuracy also improves, this would show the benefit of more work on the modeling side. We study this question both theoretically and empirically. We do show that it is unfortunately possible to construct small adversarial biases in behavior that lead to arbitrarily large errors in the inferred reward. However, and arguably more importantly, we are also able to identify reasonable assumptions under which the reward inference error can be bounded linearly in the error in the human model. Finally, we verify our theoretical insights in discrete and continuous control tasks with simulated and human data.