与随机准蒙特·卡洛（Carlo）的政策学习和评估

论文标题

与随机准蒙特·卡洛（Carlo）的政策学习和评估

Policy Learning and Evaluation with Randomized Quasi-Monte Carlo

论文作者

Arnold, Sebastien M. R., L'Ecuyer, Pierre, Chen, Liyu, Chen, Yi-fan, Sha, Fei

论文摘要

强化学习不断地处理硬积分，例如在计算政策评估和政策迭代中的期望时。这些积分很少在分析上解决，通常用蒙特卡洛方法估算，从而引起策略值和梯度的较高差异。在这项工作中，我们建议用低静止点集代替蒙特卡洛样品。我们将策略梯度方法与随机的准蒙特卡洛相结合，从而产生了降低策略梯度和参与者 - 批判算法的方差。这些配方对政策评估和政策改进有效，因为它们在标准化连续控制基准上的表现优于最先进的算法。我们的经验分析验证了用准蒙特卡洛代替蒙特卡洛的直觉可显着准确的梯度估计。

Reinforcement learning constantly deals with hard integrals, for example when computing expectations in policy evaluation and policy iteration. These integrals are rarely analytically solvable and typically estimated with the Monte Carlo method, which induces high variance in policy values and gradients. In this work, we propose to replace Monte Carlo samples with low-discrepancy point sets. We combine policy gradient methods with Randomized Quasi-Monte Carlo, yielding variance-reduced formulations of policy gradient and actor-critic algorithms. These formulations are effective for policy evaluation and policy improvement, as they outperform state-of-the-art algorithms on standardized continuous control benchmarks. Our empirical analyses validate the intuition that replacing Monte Carlo with Quasi-Monte Carlo yields significantly more accurate gradient estimates.

下载PDF全文

下载文献需遵守相关版权规定

论文标题