论文标题
学习任务自动机,用于使用隐藏的马尔可夫模型进行增强学习
Learning Task Automata for Reinforcement Learning using Hidden Markov Models
论文作者
论文摘要
当环境稀疏而非马克维亚的回报时,使用标量奖励信号的训练加强学习(RL)代理通常是不可行的。此外,在训练之前对这些奖励功能进行手工制作很容易指定,尤其是当环境的动态仅部分知道时。本文提出了一条新型的管道,用于学习非马克维亚任务规格,作为在未知环境中的代理体验情节中简洁有限的“任务自动机”。我们利用两种关键算法见解。首先,我们通过将产品MDP视为部分可观察到的MDP并使用众所周知的Baum-Welch算法来学习隐藏的Markov模型,从而学习了由规范的自动机和环境MDP组成的产品MDP,该模型是由规范的自动机和环境MDP(最初都是未知)组成的。其次,我们提出了一种新的方法,用于从学习的产品MDP中提取任务自动机(假定为确定性有限自动机)。我们学到的任务自动机使任务将任务分解为其组成子任务,从而提高了RL代理可以随后综合最佳策略的速率。它还提供了可解释的高级环境和任务功能的编码,因此人可以很容易地验证代理商是否在没有任何错误的情况下学习了连贯的任务。此外,我们采取步骤确保学识渊博的自动机是环境不合时式,使其非常适合用于转移学习。最后,我们与两个基准相比提供了实验结果,以说明我们在不同环境和任务中的性能。
Training reinforcement learning (RL) agents using scalar reward signals is often infeasible when an environment has sparse and non-Markovian rewards. Moreover, handcrafting these reward functions before training is prone to misspecification, especially when the environment's dynamics are only partially known. This paper proposes a novel pipeline for learning non-Markovian task specifications as succinct finite-state `task automata' from episodes of agent experience within unknown environments. We leverage two key algorithmic insights. First, we learn a product MDP, a model composed of the specification's automaton and the environment's MDP (both initially unknown), by treating the product MDP as a partially observable MDP and using the well-known Baum-Welch algorithm for learning hidden Markov models. Second, we propose a novel method for distilling the task automaton (assumed to be a deterministic finite automaton) from the learnt product MDP. Our learnt task automaton enables the decomposition of a task into its constituent sub-tasks, which improves the rate at which an RL agent can later synthesise an optimal policy. It also provides an interpretable encoding of high-level environmental and task features, so a human can readily verify that the agent has learnt coherent tasks with no misspecifications. In addition, we take steps towards ensuring that the learnt automaton is environment-agnostic, making it well-suited for use in transfer learning. Finally, we provide experimental results compared with two baselines to illustrate our algorithm's performance in different environments and tasks.