论文标题
与隐性语言学习的自然语言生成的离线RL
Offline RL for Natural Language Generation with Implicit Language Q Learning
论文作者
论文摘要
大型语言模型从文本语料库中提取广泛的知识。但是,在完成用户指定的任务时,它们可能是不一致的。可以通过在策划数据集上的监督学习或通过增强学习来解决此类模型来解决此问题。在这项工作中,我们提出了一种新颖的离线RL方法,即用于语言模型的隐性语言Q-学习(ILQL),将RL算法的灵活效用最大化框架与监督学习的能力相结合,以利用先前收集的数据,以及它的简单性和稳定性。我们的方法采用了价值保守主义以及学习值函数中隐式数据集支持约束的结合,然后将其用于指导语言模型世代最大化用户指定的实用程序功能。除了实证验证ILQL外,我们还提供了对离线RL可以在自然语言生成环境中很有用的情况的详细经验分析,这表明它如何比先前的端到端对话方法更有效的效用优化器,以及如何有效地基于主观判断力,例如对毒性或不进行评论的高度差异功能,如何有效地优化高度差异功能。
Large language models distill broad knowledge from text corpora. However, they can be inconsistent when it comes to completing user specified tasks. This issue can be addressed by finetuning such models via supervised learning on curated datasets, or via reinforcement learning. In this work, we propose a novel offline RL method, implicit language Q-learning (ILQL), designed for use on language models, that combines both the flexible utility maximization framework of RL algorithms with the ability of supervised learning to leverage previously collected data, as well as its simplicity and stability. Our method employs a combination of value conservatism alongside an implicit dataset support constraint in learning value functions, which are then used to guide language model generations towards maximizing user-specified utility functions. In addition to empirically validating ILQL, we present a detailed empirical analysis of situations where offline RL can be useful in natural language generation settings, demonstrating how it can be a more effective utility optimizer than prior approaches for end-to-end dialogue, and how it can effectively optimize high variance reward functions based on subjective judgement, such as whether to label a comment as toxic or not.