论文标题
学习重复使用干扰因素以支持多项选择问题的产生
Learning to Reuse Distractors to support Multiple Choice Question Generation in Education
论文作者
论文摘要
多项选择问题(MCQ)在数字学习系统中广泛使用,因为它们允许自动化评估过程。但是,由于学生的数字素养提高和社交媒体平台的出现,MCQ测试被广泛在线共享,并且教师不断挑战创建新问题,这是一项昂贵且耗时的任务。 MCQ创建的一个特别敏感的方面是设计相关的干扰因素,即不容易识别为错误的错误答案。本文研究了一大批现有的一组手动手动的答案和干扰因素,可以利用各种领域,主题和语言的问题来帮助教师通过现有干扰器的智能重用来帮助教师创建新的MCQ。我们基于上下文感知的问题和干扰器表示,构建了几个数据驱动的模型,并将其与基于静态功能的模型进行了比较。通过自动指标和与教师进行现实的用户测试对所提出的模型进行评估。自动和人类评估都表明,上下文感知模型始终优于基于静态特征的方法。对于我们表现最佳的上下文感知模型,在向教师展示的10个分散者中,平均有3个干扰因素被评为高质量的干扰者。我们创建了性能基准,并使其公开,以便在不同方法之间进行比较,并对任务进行更标准化的评估。该基准包含298个教育问题的测试,其中涵盖了多个学科和语言,以及77K的多语言分散术词汇库,用于未来的研究。
Multiple choice questions (MCQs) are widely used in digital learning systems, as they allow for automating the assessment process. However, due to the increased digital literacy of students and the advent of social media platforms, MCQ tests are widely shared online, and teachers are continuously challenged to create new questions, which is an expensive and time-consuming task. A particularly sensitive aspect of MCQ creation is to devise relevant distractors, i.e., wrong answers that are not easily identifiable as being wrong. This paper studies how a large existing set of manually created answers and distractors for questions over a variety of domains, subjects, and languages can be leveraged to help teachers in creating new MCQs, by the smart reuse of existing distractors. We built several data-driven models based on context-aware question and distractor representations, and compared them with static feature-based models. The proposed models are evaluated with automated metrics and in a realistic user test with teachers. Both automatic and human evaluations indicate that context-aware models consistently outperform a static feature-based approach. For our best-performing context-aware model, on average 3 distractors out of the 10 shown to teachers were rated as high-quality distractors. We create a performance benchmark, and make it public, to enable comparison between different approaches and to introduce a more standardized evaluation of the task. The benchmark contains a test of 298 educational questions covering multiple subjects & languages and a 77k multilingual pool of distractor vocabulary for future research.