PTT5：在巴西葡萄牙数据上进行预处理和验证T5模型

论文标题

PTT5：在巴西葡萄牙数据上进行预处理和验证T5模型

PTT5: Pretraining and validating the T5 model on Brazilian Portuguese data

论文作者

Carmo, Diedre, Piau, Marcos, Campiotti, Israel, Nogueira, Rodrigo, Lotufo, Roberto

论文摘要

在自然语言处理（NLP）中，葡萄牙需要更多资源，因为最先进的研究中使用的许多数据都是其他语言。在本文中，我们在BRWAC语料库（葡萄牙语中广泛的网页集合）上为T5模型预算了T5模型，并在其他三个不同任务上对其他葡萄牙预处理的模型和多语言模型进行了评估。我们证明，我们的葡萄牙预读模型在原始T5模型中具有明显更好的性能。此外，我们证明了使用葡萄牙词汇的积极影响。我们的代码和型号可在https://github.com/unicamp-dl/ptt5上找到。

In natural language processing (NLP), there is a need for more resources in Portuguese, since much of the data used in the state-of-the-art research is in other languages. In this paper, we pretrain a T5 model on the BrWac corpus, an extensive collection of web pages in Portuguese, and evaluate its performance against other Portuguese pretrained models and multilingual models on three different tasks. We show that our Portuguese pretrained models have significantly better performance over the original T5 models. Moreover, we demonstrate the positive impact of using a Portuguese vocabulary. Our code and models are available at https://github.com/unicamp-dl/PTT5.

下载PDF全文

下载文献需遵守相关版权规定

论文标题