公正游戏：增强学习的挑战

论文标题

公正游戏：增强学习的挑战

Impartial Games: A Challenge for Reinforcement Learning

论文作者

Zhou, Bei, Riis, Søren

论文摘要

Alphazero风格的增强学习（RL）算法在许多复杂的棋盘游戏（例如国际象棋，Shogi和Go）中都取得了超人的性能。但是，我们表明，这些算法在应用于公正游戏时遇到了重大而根本的挑战，这是玩家共享游戏作品和最佳策略通常依赖抽象数学原理的课程。具体而言，我们利用NIM的游戏作为具体和说明性的案例研究，以揭示α-风格和类似自我播放RL算法的关键局限性。我们介绍了一个新颖的概念框架，区分冠军和专家掌握，以评估RL代理的性能。我们的发现表明，尽管Alphazero风格的代理商可以在很小的NIM董事会上获得冠军级别的比赛，但随着董事会规模的增加，他们的学习进步会严重降低。这种困难不仅源于复杂的数据分布或嘈杂的标签，还源于更深的代表性瓶颈：通用神经网络的固有斗争，即隐式学习抽象的非缔合功能，例如奇偶校验，这对于公正游戏中的最佳游戏至关重要。这种局限性导致对自我播放RL必不可少的积极反馈循环的严重分解，从而阻止了有效的学习，超越了对经常观察到的状态的记忆。这些结果符合对Alphazero风格算法对对抗性攻击的脆弱性的更广泛的关注，强调了他们无法真正掌握所有法律游戏状态。我们的工作强调了简单的高参数调整不足以克服这些挑战，为发展根本新颖的算法方法建立了至关重要的基础，有可能涉及神经符号或元学习范式，以弥合合并游戏中真正的专家AI的差距。

AlphaZero-style reinforcement learning (RL) algorithms have achieved superhuman performance in many complex board games such as Chess, Shogi, and Go. However, we showcase that these algorithms encounter significant and fundamental challenges when applied to impartial games, a class where players share game pieces and optimal strategy often relies on abstract mathematical principles. Specifically, we utilize the game of Nim as a concrete and illustrative case study to reveal critical limitations of AlphaZero-style and similar self-play RL algorithms. We introduce a novel conceptual framework distinguishing between champion and expert mastery to evaluate RL agent performance. Our findings reveal that while AlphaZero-style agents can achieve champion-level play on very small Nim boards, their learning progression severely degrades as the board size increases. This difficulty stems not merely from complex data distributions or noisy labels, but from a deeper representational bottleneck: the inherent struggle of generic neural networks to implicitly learn abstract, non-associative functions like parity, which are crucial for optimal play in impartial games. This limitation causes a critical breakdown in the positive feedback loop essential for self-play RL, preventing effective learning beyond rote memorization of frequently observed states. These results align with broader concerns regarding AlphaZero-style algorithms' vulnerability to adversarial attacks, highlighting their inability to truly master all legal game states. Our work underscores that simple hyperparameter adjustments are insufficient to overcome these challenges, establishing a crucial foundation for the development of fundamentally novel algorithmic approaches, potentially involving neuro-symbolic or meta-learning paradigms, to bridge the gap towards true expert-level AI in combinatorial games.

下载PDF全文

下载文献需遵守相关版权规定

论文标题