论文标题
MANTA:有效的基于梯度的令牌化,用于强大的端到端语言建模
MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End Language Modeling
论文作者
论文摘要
静态子字令牌化算法一直是语言建模最近作品的重要组成部分。但是,它们的静态性质会导致重要缺陷,从而降低了模型的下游性能和稳健性。在这项工作中,我们提出了麦塔(Manta),这是一种自适应神经令牌化的模块。 Manta是通过语言模型的端到端训练的一个可区分的令牌。最终的系统在字节级模型的表现力与使用子字令牌训练的模型的速度之间进行了权衡。此外,我们的令牌是可以高度解释的,因为它可以将序列的明确分割成块。我们在来自不同域以及合成噪声的几个英语数据集上评估了我们的预训练模型。我们发现,蝠ta可以提高对角色扰动和室外数据的鲁棒性。然后,我们证明麦塔在通用域胶水测试基准上的其他模型相当。最后,我们证明它比严格的字节级模型要快得多。
Static subword tokenization algorithms have been an essential component of recent works on language modeling. However, their static nature results in important flaws that degrade the models' downstream performance and robustness. In this work, we propose MANTa, a Module for Adaptive Neural TokenizAtion. MANTa is a differentiable tokenizer trained end-to-end with the language model. The resulting system offers a trade-off between the expressiveness of byte-level models and the speed of models trained using subword tokenization. In addition, our tokenizer is highly explainable since it produces an explicit segmentation of sequences into blocks. We evaluate our pre-trained model on several English datasets from different domains as well as on synthetic noise. We find that MANTa improves robustness to character perturbations and out-of-domain data. We then show that MANTa performs comparably to other models on the general-domain GLUE benchmark. Finally, we show that it is considerably faster than strictly byte-level models.