大语言模型可以理解医疗问题吗？

论文标题

大语言模型可以理解医疗问题吗？

Can large language models reason about medical questions?

论文作者

Liévin, Valentin, Hother, Christoffer Egeberg, Motzfeldt, Andreas Geert, Winther, Ole

论文摘要

尽管大型语言模型（LLMS）经常产生令人印象深刻的输出，但尚不清楚它们在需要强大的推理技能和专家领域知识的现实情况下如何表现。我们着手研究封闭和开源模型（GPT-3.5，Llama-2等）是否可以用于回答和理由有关困难的基于现实世界的问题。我们专注于三个受欢迎的医疗基准（MEDQA-USMLE，MEDMCQA和PubMedQA）和多个提示方案：思想链（COT，想想逐步），很少射击和检索增强。基于对生成的COTS的专家注释，我们发现Consendgpt经常可以阅读，推理和回忆专家知识。最后，通过利用及时工程的进步（几乎没有射击和集合方法），我们证明了GPT-3.5不仅会产生校准的预测分布，而且还达到了三个数据集的传递分数：MEDQA-USMLE 60.2％，MEDMCQA 62.7％和PubMedQA 78.2％。开源模型正在缩小差距：Llama-2 70b也以62.5％的精度通过了MEDQA-USMLE。

Although large language models (LLMs) often produce impressive outputs, it remains unclear how they perform in real-world scenarios requiring strong reasoning skills and expert domain knowledge. We set out to investigate whether close- and open-source models (GPT-3.5, LLama-2, etc.) can be applied to answer and reason about difficult real-world-based questions. We focus on three popular medical benchmarks (MedQA-USMLE, MedMCQA, and PubMedQA) and multiple prompting scenarios: Chain-of-Thought (CoT, think step-by-step), few-shot and retrieval augmentation. Based on an expert annotation of the generated CoTs, we found that InstructGPT can often read, reason and recall expert knowledge. Last, by leveraging advances in prompt engineering (few-shot and ensemble methods), we demonstrated that GPT-3.5 not only yields calibrated predictive distributions, but also reaches the passing score on three datasets: MedQA-USMLE 60.2%, MedMCQA 62.7% and PubMedQA 78.2%. Open-source models are closing the gap: Llama-2 70B also passed the MedQA-USMLE with 62.5% accuracy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题