这就是方式：设计和编译Lepiszcze，这是波兰的全面NLP基准

论文标题

这就是方式：设计和编译Lepiszcze，这是波兰的全面NLP基准

This is the way: designing and compiling LEPISZCZE, a comprehensive NLP benchmark for Polish

论文作者

Augustyniak, Łukasz, Tagowski, Kamil, Sawczyn, Albert, Janiak, Denis, Bartusiak, Roman, Szymczak, Adrian, Wątroba, Marcin, Janz, Arkadiusz, Szymański, Piotr, Morzy, Mikołaj, Kajdanowicz, Tomasz, Piasecki, Maciej

论文摘要

计算和数据训练越来越大的语言模型的可用性增加了对LM培训真正进步的强大方法的需求。近年来，英语的标准化基准测试取得了重大进展。胶水，超级或苏格兰短裙等基准已成为比较大型语言模型的事实上的标准工具。随着复制其他语言胶水的趋势，KLEJ基准已发布给波兰语。在本文中，我们评估了低资源语言基准测试的进度。我们注意到，只有少数语言具有如此全面的基准。我们还注意到，基准对资源丰富的英语/中文以及世界其他地区评估的任务数量的差距。在本文中，我们介绍了Lepiszcze（Glew的波兰语词，英语中间的胶水），这是一种用于波兰NLP的新的，全面的基准，具有各种各样的任务和基准的高质量操作。我们设计了灵活性的Lepiszcze。包括新型号，数据集和任务尽可能简单，同时仍提供数据版本控制和模型跟踪。在基准的第一次运行中，我们根据五个最新的LMS测试13个实验（任务和数据集对）。我们使用波兰基准测试的五个数据集，并添加八个新颖的数据集。作为本文的主要贡献，除了莱皮斯克斯（Lepiszcze）之外，我们还提供了洞察力和经验，同时为波兰语创建基准，作为设计其他低资源语言的类似基准的蓝图。

The availability of compute and data to train larger and larger language models increases the demand for robust methods of benchmarking the true progress of LM training. Recent years witnessed significant progress in standardized benchmarking for English. Benchmarks such as GLUE, SuperGLUE, or KILT have become de facto standard tools to compare large language models. Following the trend to replicate GLUE for other languages, the KLEJ benchmark has been released for Polish. In this paper, we evaluate the progress in benchmarking for low-resourced languages. We note that only a handful of languages have such comprehensive benchmarks. We also note the gap in the number of tasks being evaluated by benchmarks for resource-rich English/Chinese and the rest of the world. In this paper, we introduce LEPISZCZE (the Polish word for glew, the Middle English predecessor of glue), a new, comprehensive benchmark for Polish NLP with a large variety of tasks and high-quality operationalization of the benchmark. We design LEPISZCZE with flexibility in mind. Including new models, datasets, and tasks is as simple as possible while still offering data versioning and model tracking. In the first run of the benchmark, we test 13 experiments (task and dataset pairs) based on the five most recent LMs for Polish. We use five datasets from the Polish benchmark and add eight novel datasets. As the paper's main contribution, apart from LEPISZCZE, we provide insights and experiences learned while creating the benchmark for Polish as the blueprint to design similar benchmarks for other low-resourced languages.

下载PDF全文

下载文献需遵守相关版权规定

论文标题