猪油：大规模的人造反射产生

论文标题

猪油：大规模的人造反射产生

LARD: Large-scale Artificial Disfluency Generation

论文作者

Passali, T., Mavropoulos, T., Tsoumakas, G., Meditskos, G., Vrochidis, S.

论文摘要

差异检测是实时对话系统中的关键任务。但是，尽管它很重要，但它仍然是一个相对未开发的字段，这主要是由于缺乏适当的数据集。同时，现有数据集遇到了各种问题，包括类不平衡问题，这可能会严重影响模型在稀有类别上的性能，正如本文所证明的那样。为此，我们提出了猪油，这是一种几乎没有努力产生复杂而现实的人工裂变的方法。所提出的方法可以处理三种最常见的分裂类型：重复，替换和重新启动。此外，我们发布了一个新的大型数据集，该数据集具有可见的，可用于四个不同的任务：弱点检测，分类，提取和校正。猪油数据集上的实验结果表明，该方法产生的数据可有效地用于检测和消除疏离，同时还可以解决现有数据集的局限性。

Disfluency detection is a critical task in real-time dialogue systems. However, despite its importance, it remains a relatively unexplored field, mainly due to the lack of appropriate datasets. At the same time, existing datasets suffer from various issues, including class imbalance issues, which can significantly affect the performance of the model on rare classes, as it is demonstrated in this paper. To this end, we propose LARD, a method for generating complex and realistic artificial disfluencies with little effort. The proposed method can handle three of the most common types of disfluencies: repetitions, replacements and restarts. In addition, we release a new large-scale dataset with disfluencies that can be used on four different tasks: disfluency detection, classification, extraction and correction. Experimental results on the LARD dataset demonstrate that the data produced by the proposed method can be effectively used for detecting and removing disfluencies, while also addressing limitations of existing datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题