论文标题

DAGAM:随着生成和修改的数据增强

DAGAM: Data Augmentation with Generation And Modification

论文作者

Jo, Byeong-Cheol, Heo, Tak-Sung, Park, Yeongjoon, Yoo, Yongmin, Cho, Won Ik, Kim, Kyungsun

论文摘要

文本分类是自然语言处理的代表性下游任务,自从基于变压器体系结构的预训练语言模型出现以来,表现出色。但是,在预训练的语言模型中,由于模型的大小与可用的培训数据的量相比,该模型的大小通常很大,因此通常发生不合身。除了在现代机器学习范式中数据收集的重要性外,还积极进行了自然语言数据增强的研究。鉴于此,我们介绍了三种数据增强方案,这些方案有助于减少大规模语言模型的不足问题。我们主要使用生成模型进行数据增强,该模型定义为使用生成(DAG)的数据增强。接下来,我们使用文本修改技术(例如损坏和单词订单更改(使用修改,大坝)的数据增强)来增强数据。最后,我们提出了数据增强与生成和修改(DAGAM),该数据结合了DAG和DAG技术以增强性能。我们为文本分类任务的六个基准数据集进行了数据增强,并通过基于BERT的微调和评估来验证DAG,DAM和DAGAM的有用性,与原始数据集的性能相比,获得了更好的结果。

Text classification is a representative downstream task of natural language processing, and has exhibited excellent performance since the advent of pre-trained language models based on Transformer architecture. However, in pre-trained language models, under-fitting often occurs due to the size of the model being very large compared to the amount of available training data. Along with significant importance of data collection in modern machine learning paradigm, studies have been actively conducted for natural language data augmentation. In light of this, we introduce three data augmentation schemes that help reduce underfitting problems of large-scale language models. Primarily we use a generation model for data augmentation, which is defined as Data Augmentation with Generation (DAG). Next, we augment data using text modification techniques such as corruption and word order change (Data Augmentation with Modification, DAM). Finally, we propose Data Augmentation with Generation And Modification (DAGAM), which combines DAG and DAM techniques for a boosted performance. We conduct data augmentation for six benchmark datasets of text classification task, and verify the usefulness of DAG, DAM, and DAGAM through BERT-based fine-tuning and evaluation, deriving better results compared to the performance with original datasets.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源