论文标题
Storium:机器中的故事集的数据集和评估平台
STORIUM: A Dataset and Evaluation Platform for Machine-in-the-Loop Story Generation
论文作者
论文摘要
鉴于输入环境,要求故事产生系统制作出合理且令人愉快的故事。该任务被指定了,因为大量不同的故事可以源于单个输入。较大的输出空间使得建立和评估故事生成模型变得困难,因为(1)现有数据集缺乏足够丰富的上下文来指导模型,并且(2)现有的评估(众包和自动)对于评估长形式创意文本是不可靠的。为了解决这些问题,我们介绍了一个由在线协作讲故事社区Storium构建的数据集和评估平台。我们的作者生成的数据集包含6K冗长的故事(12500万代币),其中包含精细的自然语言注释(例如,角色目标和属性)散布在每个叙述中,为指导模型构成了强大的源头。我们通过将其集成到Storium上,评估了在数据集上进行微调的语言模型,在该模型中,真正的作者可以在其中查询模型中建议的故事连续性,然后对其进行编辑。根据这些编辑计算的自动指标与半结构化用户访谈中生成的故事的用户评分和定性反馈都非常相关。我们释放Storium数据集和评估平台,以促进故事生成更多的原则研究。
Systems for story generation are asked to produce plausible and enjoyable stories given an input context. This task is underspecified, as a vast number of diverse stories can originate from a single input. The large output space makes it difficult to build and evaluate story generation models, as (1) existing datasets lack rich enough contexts to meaningfully guide models, and (2) existing evaluations (both crowdsourced and automatic) are unreliable for assessing long-form creative text. To address these issues, we introduce a dataset and evaluation platform built from STORIUM, an online collaborative storytelling community. Our author-generated dataset contains 6K lengthy stories (125M tokens) with fine-grained natural language annotations (e.g., character goals and attributes) interspersed throughout each narrative, forming a robust source for guiding models. We evaluate language models fine-tuned on our dataset by integrating them onto STORIUM, where real authors can query a model for suggested story continuations and then edit them. Automatic metrics computed over these edits correlate well with both user ratings of generated stories and qualitative feedback from semi-structured user interviews. We release both the STORIUM dataset and evaluation platform to spur more principled research into story generation.