通过联合学习的跨设备学习的多乔布智能调度

论文标题

通过联合学习的跨设备学习的多乔布智能调度

Multi-Job Intelligent Scheduling with Cross-Device Federated Learning

论文作者

Liu, Ji, Jia, Juncheng, Ma, Beichen, Zhou, Chendi, Zhou, Jingbo, Zhou, Yang, Dai, Huaiyu, Dou, Dejing

论文摘要

近年来，在最终用户的各种（边缘）设备中见证了大量的分散数据，而由于法规和法律，分散的数据聚合对于机器学习工作仍然很复杂。作为处理分散数据的实用方法，联合学习（FL）可以使协作全球机器学习模型培训无需共享敏感的原始数据。服务器将设备安排到FL培训过程中的工作。相比之下，在FL中有多个作业的设备调度仍然是一个关键和开放的问题。在本文中，我们提出了一个新型的多乔布FL框架，该框架可以并行多个工作的培训过程。 Multi-Job FL框架由系统模型和调度方法组成。该系统模型可以实现多个工作的并行培训过程，其成本模型基于数据的公平性和在并行培训过程中不同设备的训练时间。我们根据多种调度方法提出了一种新颖的智能调度方法，包括基于原始的强化学习调度方法和一种原始的基于贝叶斯优化的调度方法，该方法对应于小额成本时，同时将设备调度设备到多个作业。我们通过不同的工作和数据集进行了广泛的实验。实验结果表明，我们提出的方法在训练时间（快速速度更快）和准确性（高达46.4％）方面明显超过基线方法。

Recent years have witnessed a large amount of decentralized data in various (edge) devices of end-users, while the decentralized data aggregation remains complicated for machine learning jobs because of regulations and laws. As a practical approach to handling decentralized data, Federated Learning (FL) enables collaborative global machine learning model training without sharing sensitive raw data. The servers schedule devices to jobs within the training process of FL. In contrast, device scheduling with multiple jobs in FL remains a critical and open problem. In this paper, we propose a novel multi-job FL framework, which enables the training process of multiple jobs in parallel. The multi-job FL framework is composed of a system model and a scheduling method. The system model enables a parallel training process of multiple jobs, with a cost model based on the data fairness and the training time of diverse devices during the parallel training process. We propose a novel intelligent scheduling approach based on multiple scheduling methods, including an original reinforcement learning-based scheduling method and an original Bayesian optimization-based scheduling method, which corresponds to a small cost while scheduling devices to multiple jobs. We conduct extensive experimentation with diverse jobs and datasets. The experimental results reveal that our proposed approaches significantly outperform baseline approaches in terms of training time (up to 12.73 times faster) and accuracy (up to 46.4% higher).

下载PDF全文

下载文献需遵守相关版权规定

论文标题