通过潜在的低级结构克服长层屏障，以进行样品有效的增强学习

论文标题

通过潜在的低级结构克服长层屏障，以进行样品有效的增强学习

Overcoming the Long Horizon Barrier for Sample-Efficient Reinforcement Learning with Latent Low-Rank Structure

论文作者

Sam, Tyler, Chen, Yudong, Yu, Christina Lee

论文摘要

强化学习算法的实用性由于相对于问题的规模的扩展而受到限制，因为学习$ε$ - 优先政策的样本复杂性是$ \tildeΩ\ left（| s | s | s | s | h^3 /ε^2 \ right）在状态空间$ s $ $ $ $ $ $ $ a $ $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a a $ a $ a a $ a $ a a $ a》中。我们考虑一类MDP，相关的最佳$ q^*$函数为低等级，而潜在特征未知。虽然人们希望由于低级结构而在$ | s | $和$ | a | $中实现线性样本复杂性，但我们表明，如果不施加超过$ q^*$的较低等级以上的进一步的假设，则如果将一个人限制为估算$ q $函数，仅使用从条目中使用的观察值，则必须从一个最糟糕的情况下进行proment proment samplient complace upentix $ $ helly $ h $ h $ h $ h $ h $ h $ h $ h $ h，以$ he $ h $ h $ h $ h $ h $ h。随后，我们表明，在更强的低等级结构假设下，给定使用生成模型，低等级的蒙特卡洛政策迭代（LR-MCPI）和低等级的经验价值迭代（LR-EVI）实现了所需的样品复杂性的$ \ tilde {o} {O} \ weft（| $ d $设置，相对于$ | s |，| a | $和$ε$的缩放而言，这是最佳的最佳选择。与线性和低级别MDP的文献相反，我们不需要已知的功能映射，我们的算法在计算上很简单，并且我们的结果长期存在。我们的结果提供了有关MDP对过渡内核所需的最小低级结构假设的见解。

The practicality of reinforcement learning algorithms has been limited due to poor scaling with respect to the problem size, as the sample complexity of learning an $ε$-optimal policy is $\tildeΩ\left(|S||A|H^3 / ε^2\right)$ over worst case instances of an MDP with state space $S$, action space $A$, and horizon $H$. We consider a class of MDPs for which the associated optimal $Q^*$ function is low rank, where the latent features are unknown. While one would hope to achieve linear sample complexity in $|S|$ and $|A|$ due to the low rank structure, we show that without imposing further assumptions beyond low rank of $Q^*$, if one is constrained to estimate the $Q$ function using only observations from a subset of entries, there is a worst case instance in which one must incur a sample complexity exponential in the horizon $H$ to learn a near optimal policy. We subsequently show that under stronger low rank structural assumptions, given access to a generative model, Low Rank Monte Carlo Policy Iteration (LR-MCPI) and Low Rank Empirical Value Iteration (LR-EVI) achieve the desired sample complexity of $\tilde{O}\left((|S|+|A|)\mathrm{poly}(d,H)/ε^2\right)$ for a rank $d$ setting, which is minimax optimal with respect to the scaling of $|S|, |A|$, and $ε$. In contrast to literature on linear and low-rank MDPs, we do not require a known feature mapping, our algorithm is computationally simple, and our results hold for long time horizons. Our results provide insights on the minimal low-rank structural assumptions required on the MDP with respect to the transition kernel versus the optimal action-value function.

下载PDF全文

下载文献需遵守相关版权规定

论文标题