宪法AI：无害反馈的无害

论文标题

宪法AI：无害反馈的无害

Constitutional AI: Harmlessness from AI Feedback

论文作者

Bai, Yuntao, Kadavath, Saurav, Kundu, Sandipan, Askell, Amanda, Kernion, Jackson, Jones, Andy, Chen, Anna, Goldie, Anna, Mirhoseini, Azalia, McKinnon, Cameron, Chen, Carol, Olsson, Catherine, Olah, Christopher, Hernandez, Danny, Drain, Dawn, Ganguli, Deep, Li, Dustin, Tran-Johnson, Eli, Perez, Ethan, Kerr, Jamie, Mueller, Jared, Ladish, Jeffrey, Landau, Joshua, Ndousse, Kamal, Lukosuite, Kamile, Lovitt, Liane, Sellitto, Michael, Elhage, Nelson, Schiefer, Nicholas, Mercado, Noemi, DasSarma, Nova, Lasenby, Robert, Larson, Robin, Ringer, Sam, Johnston, Scott, Kravec, Shauna, Showk, Sheer El, Fort, Stanislav, Lanham, Tamera, Telleen-Lawton, Timothy, Conerly, Tom, Henighan, Tom, Hume, Tristan, Bowman, Samuel R., Hatfield-Dodds, Zac, Mann, Ben, Amodei, Dario, Joseph, Nicholas, McCandlish, Sam, Brown, Tom, Kaplan, Jared

论文摘要

随着AI系统变得越来越有能力，我们希望获得他们的帮助来监督其他AIS。我们尝试通过自我完善训练无害的AI助手的方法，而没有任何人类标签可以识别有害产量。人类唯一的监督是通过规则或原则清单提供的，因此我们将该方法称为“宪法AI”。该过程既涉及监督的学习和强化学习阶段。在监督阶段，我们从初始模型中进行采样，然后生成自我批评和修订，然后对修订的响应进行原始模型进行捕获。在RL阶段，我们从填充模型中进行采样，使用模型评估两个样本中的哪个更好，然后从此AI偏好数据集中训练偏好模型。然后，我们使用偏好模型作为奖励信号进行训练，即我们使用“来自AI反馈的RL”（RLAIF）。结果，我们能够通过解释其对他们的反对意见来训练无害但非渗透的AI助手，该助手通过解释其反对意见。 SL和RL方法都可以利用经过深思熟虑的风格推理来提高人工智能决策的人为绩效和透明度。这些方法使得可以更精确地控制AI行为，并且具有较少的人类标签。

As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.

下载PDF全文

下载文献需遵守相关版权规定

论文标题