论文标题
在机器翻译中为强大的增强学习而发疯
MAD for Robust Reinforcement Learning in Machine Translation
论文作者
论文摘要
我们介绍了一种新的分布式策略梯度算法,并表明它在培训稳定性和概括性绩效方面优于现有的奖励感知培训程序,例如加强,最低风险培训(MRT)和近端政策优化(PPO)。我们称之为MAD的算法(由于在重要性加权计算中使用平均绝对偏差,它的算法)分布式数据生成器在工作人员节点上每个源句子对多个候选者进行采样,而中心学习者则更新了策略。 MAD在两个方差降低策略上至关重要:(1)一种有条件的奖励归一化方法,可确保每个源句子都具有正面和负面奖励翻译示例,以及(2)一种新的强大重要性加权方案,该方案是有条件的熵常规化器。在各种翻译任务上进行的实验表明,使用MAD算法在使用贪婪解码和梁搜索时所学的策略表现良好,并且学到的政策对训练过程中使用的特定奖励很敏感。
We introduce a new distributed policy gradient algorithm and show that it outperforms existing reward-aware training procedures such as REINFORCE, minimum risk training (MRT) and proximal policy optimization (PPO) in terms of training stability and generalization performance when optimizing machine translation models. Our algorithm, which we call MAD (on account of using the mean absolute deviation in the importance weighting calculation), has distributed data generators sampling multiple candidates per source sentence on worker nodes, while a central learner updates the policy. MAD depends crucially on two variance reduction strategies: (1) a conditional reward normalization method that ensures each source sentence has both positive and negative reward translation examples and (2) a new robust importance weighting scheme that acts as a conditional entropy regularizer. Experiments on a variety of translation tasks show that policies learned using the MAD algorithm perform very well when using both greedy decoding and beam search, and that the learned policies are sensitive to the specific reward used during training.