具有LTL规格的半MDP的基于学习的有限合成

论文标题

具有LTL规格的半MDP的基于学习的有限合成

Learning-based Bounded Synthesis for Semi-MDPs with LTL Specifications

论文作者

Oura, Ryohei, Ushio, Toshimitsu

论文摘要

这封信提出了具有线性时间逻辑（LTL）规范的半马尔可夫决策过程（SMDP）的基于学习的有限合成。在SMDP和确定性的$ K $ -Co-BüchiAutomaton（D $ K $ CBA）的产品中，我们从LTL规范转换了，我们还了解了满足LTL规范的获胜区域，又了解基于增强性学习的LTL规范和动态。然后，我们合成满足以下两个条件的最佳策略。（1）它最大化到达胜利区域的概率。（2）它最大程度地减少了在获胜区内停留时间的长期风险。长期风险的最小化是根据估计的动态和价值迭代来完成的。我们表明，如果折现因子足够接近一个，则综合策略会收敛到最佳策略，因为勘探获得的数据数量归功于无穷大。

This letter proposes a learning-based bounded synthesis for a semi-Markov decision process (SMDP) with a linear temporal logic (LTL) specification. In the product of the SMDP and the deterministic $K$-co-Büchi automaton (d$K$cBA) converted from the LTL specification, we learn both the winning region of satisfying the LTL specification and the dynamics therein based on reinforcement learning and Bayesian inference. Then, we synthesize an optimal policy satisfying the following two conditions. (1) It maximizes the probability of reaching the wining region. (2) It minimizes a long-term risk for the dwell time within the winning region. The minimization of the long-term risk is done based on the estimated dynamics and a value iteration. We show that, if the discount factor is sufficiently close to one, the synthesized policy converges to the optimal policy as the number of the data obtained by the exploration goes to the infinity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题