有效的元加强学习以基于偏好的快速适应

论文标题

有效的元加强学习以基于偏好的快速适应

Efficient Meta Reinforcement Learning for Preference-based Fast Adaptation

论文作者

Ren, Zhizhou, Liu, Anji, Liang, Yitao, Peng, Jian, Ma, Jianzhu

论文摘要

从几次试验中学习新任务的技能是人工智能的基本挑战。元加强学习（META-RL）通过学习可转移的政策来解决这个问题，这些政策支持几乎没有看到的任务。尽管元RL最近取得了进步，但大多数现有方法都需要访问新任务的环境奖励功能，以推断任务目标，这在许多实际应用中都不现实。为了弥合这一差距，我们研究了在人类的增强学习的背景下，很少射击适应的问题。我们开发了一种元RL算法，该算法可以通过基于首选项的反馈来实现快速的策略适应。代理可以通过查询人类在行为轨迹之间的偏好而不是使用每个步骤数字奖励来适应新任务。通过从信息理论中扩展技术，我们的方法可以设计查询序列，以最大程度地利用人类相互作用的信息增益，同时容忍非专家人类甲骨文的固有误差。在实验中，我们在各种元素基准任务上广泛评估了我们的方法，用嘈杂的甲骨文（Anole）适应，并在反馈效率和误差效率方面证明了基线算法对基线算法的显着改善。

Learning new task-specific skills from a few trials is a fundamental challenge for artificial intelligence. Meta reinforcement learning (meta-RL) tackles this problem by learning transferable policies that support few-shot adaptation to unseen tasks. Despite recent advances in meta-RL, most existing methods require the access to the environmental reward function of new tasks to infer the task objective, which is not realistic in many practical applications. To bridge this gap, we study the problem of few-shot adaptation in the context of human-in-the-loop reinforcement learning. We develop a meta-RL algorithm that enables fast policy adaptation with preference-based feedback. The agent can adapt to new tasks by querying human's preference between behavior trajectories instead of using per-step numeric rewards. By extending techniques from information theory, our approach can design query sequences to maximize the information gain from human interactions while tolerating the inherent error of non-expert human oracle. In experiments, we extensively evaluate our method, Adaptation with Noisy OracLE (ANOLE), on a variety of meta-RL benchmark tasks and demonstrate substantial improvement over baseline algorithms in terms of both feedback efficiency and error tolerance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题