近似两人一般差异游戏的不连续的NASH平衡值

论文标题

近似两人一般差异游戏的不连续的NASH平衡值

Approximating Discontinuous Nash Equilibrial Values of Two-Player General-Sum Differential Games

论文作者

Zhang, Lei, Ghimire, Mukesh, Zhang, Wenlong, Xu, Zhe, Ren, Yi

论文摘要

寻找两人差异游戏的NASH平衡政策需要解决Hamilton-Jacobi-Isaacs（HJI）PDES。自我监督的学习已用于近似此类PDE的解决方案，同时规避维度的诅咒。但是，由于其采样性质，此方法无法学习不连续的PDE解决方案，因此当玩家奖励是不连续的时，在机器人应用程序中，由此导致的安全性能差。本文研究了这一问题的两种潜在解决方案：一种利用监督Nash Equilibria和HJI PDE的混合方法，以及一种价值硬化方法，其中一系列HJI通过逐渐硬化的奖励来解决。我们在两项与5D和9D状态空间的车辆相互作用仿真研究中使用所得的概括和安全性能比较了这些解决方案。结果表明，通过信息性的监督（例如，碰撞和近距离演示）和自我监督学习的低成本，混合方法可以实现比在同等计算预算上的受监督，自我监督和价值硬化方法更好的安全性能。价值硬化无法在没有信息的监督的情况下在高维情况下概括。最后，我们表明，对于学习PDE，神经激活函数需要连续差异，并且其选择可能取决于情况。

Finding Nash equilibrial policies for two-player differential games requires solving Hamilton-Jacobi-Isaacs (HJI) PDEs. Self-supervised learning has been used to approximate solutions of such PDEs while circumventing the curse of dimensionality. However, this method fails to learn discontinuous PDE solutions due to its sampling nature, leading to poor safety performance of the resulting controllers in robotics applications when player rewards are discontinuous. This paper investigates two potential solutions to this problem: a hybrid method that leverages both supervised Nash equilibria and the HJI PDE, and a value-hardening method where a sequence of HJIs are solved with a gradually hardening reward. We compare these solutions using the resulting generalization and safety performance in two vehicle interaction simulation studies with 5D and 9D state spaces, respectively. Results show that with informative supervision (e.g., collision and near-collision demonstrations) and the low cost of self-supervised learning, the hybrid method achieves better safety performance than the supervised, self-supervised, and value hardening approaches on equal computational budget. Value hardening fails to generalize in the higher-dimensional case without informative supervision. Lastly, we show that the neural activation function needs to be continuously differentiable for learning PDEs and its choice can be case dependent.

下载PDF全文

下载文献需遵守相关版权规定

论文标题