论文标题
任务完成对话系统的反估计对话政策学习
Anti-Overestimation Dialogue Policy Learning for Task-Completion Dialogue System
论文作者
论文摘要
对话策略模块是任务完成对话系统的重要组成部分。最近,越来越多的兴趣集中在加强学习(RL)的对话政策上。其有利的绩效和明智的行动决策取决于对动作值的准确估计。高估问题是RL的一个众所周知的问题,因为其对最大动作值的估计大于地面真理,这导致了不稳定的学习过程和次优政策。这个问题不利于基于RL的对话政策学习。为了减轻此问题,本文提出了对地面真相最大动作值的动态部分平均估计器(DPAV)。 DPAV计算预测的最大动作值和最小动作值之间的部分平均值,其中权重动态自适应和问题依赖性。我们将DPAV纳入了对话策略,并将DPAV纳入了对话策略,并表明我们的方法可以在不同域的三个对话数据集中获得更好或可比较的结果,并具有较低的计算负载。此外,与其他方法相比,理论上还证明了收敛性并得出偏置的上限和下限。
A dialogue policy module is an essential part of task-completion dialogue systems. Recently, increasing interest has focused on reinforcement learning (RL)-based dialogue policy. Its favorable performance and wise action decisions rely on an accurate estimation of action values. The overestimation problem is a widely known issue of RL since its estimate of the maximum action value is larger than the ground truth, which results in an unstable learning process and suboptimal policy. This problem is detrimental to RL-based dialogue policy learning. To mitigate this problem, this paper proposes a dynamic partial average estimator (DPAV) of the ground truth maximum action value. DPAV calculates the partial average between the predicted maximum action value and minimum action value, where the weights are dynamically adaptive and problem-dependent. We incorporate DPAV into a deep Q-network as the dialogue policy and show that our method can achieve better or comparable results compared to top baselines on three dialogue datasets of different domains with a lower computational load. In addition, we also theoretically prove the convergence and derive the upper and lower bounds of the bias compared with those of other methods.