论文标题
交替使用MDP自动机
Alternating Good-for-MDP Automata
论文作者
论文摘要
当首先在用于控制MDP的模型增强学习(RL)中首次提出了欧米茄规范目标时,使用确定性的Rabin Automata来尝试提供从其过渡到标量值的直接翻译。尽管这些翻译失败了,但事实证明,可以使用商品MDP(GFM)BüchiAutomata来修复它们。这些是非确定性的Büchi自动机,其非确定性类型有限,尽管不像魔法自动机那样受到限制。实际上,确定性的Rabin Automata具有非常简单的转换为这样的GFM自动机,该GFM自动机在状态和成对的数量中是双线性的。有趣的是,对于确定性的Streett Automata来说,也不能说同样的话:非确定性Rabin或BüchiAutomata的翻译以指数成本的成本也是如此,即使不要求目标自动机为商品。我们需要支付的钱更多才能获得MDP Automaton吗?令人惊讶的答案是,当我们将商品MDP属性扩展到交替的自动机时,我们必须支付的费用要少得多:就像从确定性的Rabin自动机中获得的非确定性GFM自动机一样非确定性Büchi自动机。
When omega-regular objectives were first proposed in model-free reinforcement learning (RL) for controlling MDPs, deterministic Rabin automata were used in an attempt to provide a direct translation from their transitions to scalar values. While these translations failed, it has turned out that it is possible to repair them by using good-for-MDPs (GFM) Büchi automata instead. These are nondeterministic Büchi automata with a restricted type of nondeterminism, albeit not as restricted as in good-for-games automata. Indeed, deterministic Rabin automata have a pretty straightforward translation to such GFM automata, which is bi-linear in the number of states and pairs. Interestingly, the same cannot be said for deterministic Streett automata: a translation to nondeterministic Rabin or Büchi automata comes at an exponential cost, even without requiring the target automaton to be good-for-MDPs. Do we have to pay more than that to obtain a good-for-MDP automaton? The surprising answer is that we have to pay significantly less when we instead expand the good-for-MDP property to alternating automata: like the nondeterministic GFM automata obtained from deterministic Rabin automata, the alternating good-for-MDP automata we produce from deterministic Streett automata are bi-linear in the the size of the deterministic automaton and its index, and can therefore be exponentially more succinct than minimal nondeterministic Büchi automata.