Araelectra：阿拉伯语理解的训练前文本歧视者

论文标题

Araelectra：阿拉伯语理解的训练前文本歧视者

AraELECTRA: Pre-Training Text Discriminators for Arabic Language Understanding

论文作者

Antoun, Wissam, Baly, Fady, Hajj, Hazem

论文摘要

英语表示的进步可以通过有效地学习准确替换令牌替换的编码器（electra），从而实现了更有效的样本预训练任务。它没有训练模型恢复蒙版令牌，而是训练歧视模型，以区分真正的输入令牌与被发电机网络替代的损坏的令牌。另一方面，当前的阿拉伯语言表示方法仅依赖于通过掩盖语言建模进行预处理。在本文中，我们开发了一个阿拉伯语表示模型，我们将其命名为Araelectra。我们的模型使用大型阿拉伯文本语料库上的替换令牌检测目标进行了预估计。我们在多个阿拉伯语NLP任务上评估了模型，包括阅读理解，情感分析和指定性识别，我们表明Araelectra的表现优于当前最新的阿拉伯语表示模型，鉴于相同的预处理数据，甚至具有较小的模型大小。

Advances in English language representation enabled a more sample-efficient pre-training task by Efficiently Learning an Encoder that Classifies Token Replacements Accurately (ELECTRA). Which, instead of training a model to recover masked tokens, it trains a discriminator model to distinguish true input tokens from corrupted tokens that were replaced by a generator network. On the other hand, current Arabic language representation approaches rely only on pretraining via masked language modeling. In this paper, we develop an Arabic language representation model, which we name AraELECTRA. Our model is pretrained using the replaced token detection objective on large Arabic text corpora. We evaluate our model on multiple Arabic NLP tasks, including reading comprehension, sentiment analysis, and named-entity recognition and we show that AraELECTRA outperforms current state-of-the-art Arabic language representation models, given the same pretraining data and with even a smaller model size.

下载PDF全文

下载文献需遵守相关版权规定

论文标题