论文标题

通过潜在的低级结构克服长层屏障,以进行样品有效的增强学习

Overcoming the Long Horizon Barrier for Sample-Efficient Reinforcement Learning with Latent Low-Rank Structure

论文作者

Sam, Tyler, Chen, Yudong, Yu, Christina Lee

论文摘要

强化学习算法的实用性由于相对于问题的规模的扩展而受到限制,因为学习$ε$ - 优先政策的样本复杂性是$ \tildeΩ\ left(| s | s | s | s | h^3 /ε^2 \ right)在状态空间$ s $ $ $ $ $ $ $ a $ $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a a $ a $ a a $ a $ a a $ a》中。我们考虑一类MDP,相关的最佳$ q^*$函数为低等级,而潜在特征未知。虽然人们希望由于低级结构而在$ | s | $和$ | a | $中实现线性样本复杂性,但我们表明,如果不施加超过$ q^*$的较低等级以上的进一步的假设,则如果将一个人限制为估算$ q $函数,仅使用从条目中使用的观察值,则必须从一个最糟糕的情况下进行proment proment samplient complace upentix $ $ helly $ h $ h $ h $ h $ h $ h $ h $ h $ h,以$ he $ h $ h $ h $ h $ h $ h。随后,我们表明,在更强的低等级结构假设下,给定使用生成模型,低等级的蒙特卡洛政策迭代(LR-MCPI)和低等级的经验价值迭代(LR-EVI)实现了所需的样品复杂性的$ \ tilde {o} {O} \ weft(| $ d $设置,相对于$ | s |,| a | $和$ε$的缩放而言,这是最佳的最佳选择。与线性和低级别MDP的文献相反,我们不需要已知的功能映射,我们的算法在计算上很简单,并且我们的结果长期存在。我们的结果提供了有关MDP对过渡内核所需的最小低级结构假设的见解。

The practicality of reinforcement learning algorithms has been limited due to poor scaling with respect to the problem size, as the sample complexity of learning an $ε$-optimal policy is $\tildeΩ\left(|S||A|H^3 / ε^2\right)$ over worst case instances of an MDP with state space $S$, action space $A$, and horizon $H$. We consider a class of MDPs for which the associated optimal $Q^*$ function is low rank, where the latent features are unknown. While one would hope to achieve linear sample complexity in $|S|$ and $|A|$ due to the low rank structure, we show that without imposing further assumptions beyond low rank of $Q^*$, if one is constrained to estimate the $Q$ function using only observations from a subset of entries, there is a worst case instance in which one must incur a sample complexity exponential in the horizon $H$ to learn a near optimal policy. We subsequently show that under stronger low rank structural assumptions, given access to a generative model, Low Rank Monte Carlo Policy Iteration (LR-MCPI) and Low Rank Empirical Value Iteration (LR-EVI) achieve the desired sample complexity of $\tilde{O}\left((|S|+|A|)\mathrm{poly}(d,H)/ε^2\right)$ for a rank $d$ setting, which is minimax optimal with respect to the scaling of $|S|, |A|$, and $ε$. In contrast to literature on linear and low-rank MDPs, we do not require a known feature mapping, our algorithm is computationally simple, and our results hold for long time horizons. Our results provide insights on the minimal low-rank structural assumptions required on the MDP with respect to the transition kernel versus the optimal action-value function.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源