论文标题
通过对比的强化学习对讲故事的强大偏好学习
Robust Preference Learning for Storytelling via Contrastive Reinforcement Learning
论文作者
论文摘要
受控的自动故事生成旨在产生自然语言故事,从而满足自然语言批评或偏好的约束。控制故事偏好的现有方法利用迅速的工程,这是劳动密集型且通常不一致的方法。他们还可以使用logit操纵方法,这些方法需要带注释的数据集以适合所需属性。为了解决这些问题,我们首先训练一个对比的双重编码模型,以使故事与相应的人类批评相结合,名为Carp,建立通用偏好模型。随后,这被用作通过增强学习来微调生成语言模型的奖励功能。但是,只需用对比度奖励模型对生成语言模型进行微调,并不总是可靠地导致故事生成系统能够生成满足用户偏好的故事。为了提高故事产生的鲁棒性,我们使用迅速学习技术进一步微调对比度奖励模型。然后,进行了人类参与者的研究,以比较我们的完整系统,消融和两个基线的几代人。我们表明,完整的微调管道会导致故事发生器比LLM 20倍更喜欢大型和基于logit的方法。这激发了对对比度学习的通用人类偏好模型。
Controlled automated story generation seeks to generate natural language stories satisfying constraints from natural language critiques or preferences. Existing methods to control for story preference utilize prompt engineering which is labor intensive and often inconsistent. They may also use logit-manipulation methods which require annotated datasets to exist for the desired attributes. To address these issues, we first train a contrastive bi-encoder model to align stories with corresponding human critiques, named CARP, building a general purpose preference model. This is subsequently used as a reward function to fine-tune a generative language model via reinforcement learning. However, simply fine-tuning a generative language model with a contrastive reward model does not always reliably result in a story generation system capable of generating stories that meet user preferences. To increase story generation robustness we further fine-tune the contrastive reward model using a prompt-learning technique. A human participant study is then conducted comparing generations from our full system, ablations, and two baselines. We show that the full fine-tuning pipeline results in a story generator preferred over a LLM 20x as large as well as logit-based methods. This motivates the use of contrastive learning for general purpose human preference modeling.