论文标题
FedAdapter:现代NLP的有效联合学习
FedAdapter: Efficient Federated Learning for Modern NLP
论文作者
论文摘要
基于变压器的预训练模型已彻底改变了NLP的出色性能和一般性。用于下游任务的微调预训练模型通常需要私人数据,而联合学习是事实的方法(即FedNLP)。但是,我们的测量结果表明,由于模型大小和最终的高网络/计算成本,FedNLP非常慢。对于实用的FedNLP,我们确定为关键构建块适配器,插入各种型号层的小瓶颈模块。一个关键的挑战是正确配置适配器的深度和宽度,训练速度和效率高度敏感。不存在银色 - 扣配置:最佳选择在下游NLP任务,所需的模型准确性和移动资源方面有所不同。为了自动化适配器配置,我们提出了FedAdapter,该框架通过两种新型设计增强了现有的FedNLP。首先,FedAdapter在整个培训课程中逐步升级了适配器配置;原则是只要在模型的顶层培训越来越小的适配器,并通过结合更深入和较大的适配器来逐步学习深度知识,以快速学习浅知识。其次,通过将参与者设备分配给试验组,FedAdapter不断介绍未来的适配器配置。广泛的实验表明,与香草FedNLP相比,FedAdapter可以将FedNLP的模型收敛延迟延长至不超过几个小时,而与强质基线相比,它比Vanilla FedNLP和48 $ \ times $更快。
Transformer-based pre-trained models have revolutionized NLP for superior performance and generality. Fine-tuning pre-trained models for downstream tasks often requires private data, for which federated learning is the de-facto approach (i.e., FedNLP). However, our measurements show that FedNLP is prohibitively slow due to the large model sizes and the resultant high network/computation cost. Towards practical FedNLP, we identify as the key building blocks adapters, small bottleneck modules inserted at a variety of model layers. A key challenge is to properly configure the depth and width of adapters, to which the training speed and efficiency is highly sensitive. No silver-bullet configuration exists: the optimal choice varies across downstream NLP tasks, desired model accuracy, and mobile resources. To automate adapter configuration, we propose FedAdapter, a framework that enhances the existing FedNLP with two novel designs. First, FedAdapter progressively upgrades the adapter configuration throughout a training session; the principle is to quickly learn shallow knowledge by only training fewer and smaller adapters at the model's top layers, and incrementally learn deep knowledge by incorporating deeper and larger adapters. Second, FedAdapter continuously profiles future adapter configurations by allocating participant devices to trial groups. Extensive experiments show that FedAdapter can reduce FedNLP's model convergence delay to no more than several hours, which is up to 155.5$\times$ faster compared to vanilla FedNLP and 48$\times$ faster compared to strong baselines.