论文标题
磁带:评估俄罗斯语言的理解很少
TAPE: Assessing Few-shot Russian Language Understanding
论文作者
论文摘要
零射击和少数学习的最新进展已显示出有关研究和实际目的范围的希望。但是,这个快速增长的区域缺乏针对非英语语言的标准化评估套件,阻碍了以盎格鲁为中心的范式。为了解决这一研究,我们提出了磁带(文本攻击和扰动评估),这是一个新颖的基准,其中包括俄罗斯的六个复杂的NLU任务,涵盖了多跳的推理,道德概念,逻辑和常识知识。该磁带的设计着重于系统的零射击和几次NLU评估:(i)以语言为导向的对抗性攻击和分析鲁棒性的扰动,以及(ii)细微解释的亚群。测试自回归基线的详细分析表明,基于简单的基于拼写的扰动影响性能最大,而释义输入的效果更可忽略不计。同时,结果表明,对于大多数任务,神经和人类基线之间存在显着差距。我们公开发布磁带(Tape-Benchmark.com),以促进对鲁棒LM的研究,这些研究几乎没有可用,可以推广到新任务。
Recent advances in zero-shot and few-shot learning have shown promise for a scope of research and practical purposes. However, this fast-growing area lacks standardized evaluation suites for non-English languages, hindering progress outside the Anglo-centric paradigm. To address this line of research, we propose TAPE (Text Attack and Perturbation Evaluation), a novel benchmark that includes six more complex NLU tasks for Russian, covering multi-hop reasoning, ethical concepts, logic and commonsense knowledge. The TAPE's design focuses on systematic zero-shot and few-shot NLU evaluation: (i) linguistic-oriented adversarial attacks and perturbations for analyzing robustness, and (ii) subpopulations for nuanced interpretation. The detailed analysis of testing the autoregressive baselines indicates that simple spelling-based perturbations affect the performance the most, while paraphrasing the input has a more negligible effect. At the same time, the results demonstrate a significant gap between the neural and human baselines for most tasks. We publicly release TAPE (tape-benchmark.com) to foster research on robust LMs that can generalize to new tasks when little to no supervision is available.