论文标题
大小超过大小的多样性:关于主题依赖性参数挖掘数据集的样本和主题大小的影响
Diversity Over Size: On the Effect of Sample and Topic Sizes for Topic-Dependent Argument Mining Datasets
论文作者
论文摘要
参数挖掘的任务是从大型文档来源提取和分类特定主题的参数组件,这是机器学习模型和人类的固有困难任务,因为大型参数挖掘数据集很少见,并且对参数组件的识别需要专家知识。如果任务也涉及检测到的论点的立场检测,则该任务变得更加困难。在这项工作中,我们研究了在少数和零弹的设置中参数挖掘数据集组成的效果。我们的发现表明,尽管必须进行微调来实现可接受的模型性能,但使用精心组成的培训样品并将训练样本量减少多达90%仍然可以产生最大性能的95%。在三个不同数据集上的三个参数挖掘任务中,这种增益是一致的。我们还发布了一个新的数据集,用于未来的基准测试。
The task of Argument Mining, that is extracting and classifying argument components for a specific topic from large document sources, is an inherently difficult task for machine learning models and humans alike, as large Argument Mining datasets are rare and recognition of argument components requires expert knowledge. The task becomes even more difficult if it also involves stance detection of retrieved arguments. In this work, we investigate the effect of Argument Mining dataset composition in few- and zero-shot settings. Our findings show that, while fine-tuning is mandatory to achieve acceptable model performance, using carefully composed training samples and reducing the training sample size by up to almost 90% can still yield 95% of the maximum performance. This gain is consistent across three Argument Mining tasks on three different datasets. We also publish a new dataset for future benchmarking.