基于仿真的马尔可夫决策过程和多动不动的土匪的算法

论文标题

基于仿真的马尔可夫决策过程和多动不动的土匪的算法

Simulation Based Algorithms for Markov Decision Processes and Multi-Action Restless Bandits

论文作者

Meshram, Rahul, Kaza, Kesav

论文摘要

我们考虑多维马尔可夫决策过程并制定长期折扣奖励优化问题。研究了两种基于仿真的算法---蒙特卡洛推出政策和平行推出策略，并讨论了这些策略的各种属性。接下来，我们考虑具有多维状态空间和多幕匪模的不安的多武器匪徒（RMAB）。标准RMAB由每个武器的两个动作组成，而在多功能RMAB中，每个武器都有两个动作。 RMAB的一种流行方法是基于Whittle索引的启发式政策。索引性是使用基于索引的策略的重要要求。基于此，将RMAB分类为可索引或不可索引的土匪。我们的兴趣是研究蒙特 - 卡洛推出政策，用于索引和不可折磨的匪徒。我们首先分析标准可索引的RMAB（两项操作模型），并讨论基于索引的策略方法。我们使用Monte-Carlo推出策略提出了近似索引计算算法。该算法的收敛性使用两个时间尺度的随机近似方案显示。后来，我们分析了可索引的RMAB的多功能，并讨论基于指数的政策方法。我们还使用Monte-Carlo推出策略研究了标准和多功能强盗的不可索引的RMAB。

We consider multi-dimensional Markov decision processes and formulate a long term discounted reward optimization problem. Two simulation based algorithms---Monte Carlo rollout policy and parallel rollout policy are studied, and various properties for these policies are discussed. We next consider a restless multi-armed bandit (RMAB) with multi-dimensional state space and multi-actions bandit model. A standard RMAB consists of two actions for each arms whereas in multi-actions RMAB, there are more that two actions for each arms. A popular approach for RMAB is Whittle index based heuristic policy. Indexability is an important requirement to use index based policy. Based on this, an RMAB is classified into indexable or non-indexable bandits. Our interest is in the study of Monte-Carlo rollout policy for both indexable and non-indexable restless bandits. We first analyze a standard indexable RMAB (two-action model) and discuss an index based policy approach. We present approximate index computation algorithm using Monte-Carlo rollout policy. This algorithm's convergence is shown using two-timescale stochastic approximation scheme. Later, we analyze multi-actions indexable RMAB, and discuss the index based policy approach. We also study non-indexable RMAB for both standard and multi-actions bandits using Monte-Carlo rollout policy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题