驾驶室：长序列建模上的全面注意力基准

论文标题

驾驶室：长序列建模上的全面注意力基准

CAB: Comprehensive Attention Benchmarking on Long Sequence Modeling

论文作者

Zhang, Jun, Jiang, Shuyang, Feng, Jiangtao, Zheng, Lin, Kong, Lingpeng

论文摘要

变压器在语言，图像和语音处理方面取得了巨大的成功。最近，已经提出了各种有效的关注体系结构来提高变压器的效率，同时在很大程度上保留了其功效，尤其是在建模长序列时。一个广泛使用的基准测试这些有效方法在远程建模上的能力是远程竞技场（LRA）。但是，LRA仅着眼于标准的双向（或非因果）自我关注，并且完全忽略了跨关注和单向（或因果关系）的注意，这对下游应用同样重要。在本文中，我们在细粒度的分类法下提出了全面的注意力基准（CAB），具有四种可区分的注意力模式，即非殖民地自我，因果自我，非目的交叉和因果交叉的关注。 CAB从不同的研究领域收集了七个现实世界任务，以评估四种注意力模式下的有效关注。在这些任务中，CAB验证了八个骨干网络中的有效关注，以显示其跨神经体系结构的概括。我们进行详尽的实验，以基准测试九种广泛使用的有效注意体系结构，该架构设计为驾驶室的不同理念。广泛的实验结果还阐明了有效关注的基本问题，例如针对香草的效率长度，跨注意力模式的绩效一致性，注意机制的好处以及对长篇文化语言建模的插值/外推。

Transformer has achieved remarkable success in language, image, and speech processing. Recently, various efficient attention architectures have been proposed to improve transformer's efficiency while largely preserving its efficacy, especially in modeling long sequences. A widely-used benchmark to test these efficient methods' capability on long-range modeling is Long Range Arena (LRA). However, LRA only focuses on the standard bidirectional (or noncausal) self attention, and completely ignores cross attentions and unidirectional (or causal) attentions, which are equally important to downstream applications. In this paper, we propose Comprehensive Attention Benchmark (CAB) under a fine-grained attention taxonomy with four distinguishable attention patterns, namely, noncausal self, causal self, noncausal cross, and causal cross attentions. CAB collects seven real-world tasks from different research areas to evaluate efficient attentions under the four attention patterns. Among these tasks, CAB validates efficient attentions in eight backbone networks to show their generalization across neural architectures. We conduct exhaustive experiments to benchmark the performances of nine widely-used efficient attention architectures designed with different philosophies on CAB. Extensive experimental results also shed light on the fundamental problems of efficient attentions, such as efficiency length against vanilla attention, performance consistency across attention patterns, the benefit of attention mechanisms, and interpolation/extrapolation on long-context language modeling.

下载PDF全文

下载文献需遵守相关版权规定

论文标题