论文标题

网络意识到的计算和内存分配在具有深度强化学习和图形神经网络的光学综合数据中心中

Network Aware Compute and Memory Allocation in Optically Composable Data Centres with Deep Reinforcement Learning and Graph Neural Networks

论文作者

Shabka, Zacharaya, Zervas, Georgios

论文摘要

资源 - 分散的数据中心体系结构有望在数据中心内远程集合资源,从而既可以提高灵活性和资源效率,从而依靠日益重要的基础架构-AS-Service业务。这可以通过在数据中心网络(DCN)中使用光学电路切换主链来完成;提供所需的带宽和延迟保证,以确保在非本地资源池中运行应用程序时可靠的性能。但是,在这种情况下的资源分配需要服务器级\ emph {and}网络级资源,以便将其共同安排到请求。该问题的在线性质和基本组合复杂性以及DCN拓扑的典型规模使精确的解决方案不可能,基于启发式的解决方案是最佳或不直觉的设计。我们证明了\ emph {深钢筋学习},其中策略是由\ emph {图形神经网络}建模的,可用于学习有效\ emph {network-aware}和\ emph {拓扑上的}分配}分配政策。与网络感知资源分配的最新启发式方法相比,该方法达到了高达$ 20 \%$的接受率;可以以$ 3 \ times $ 3的可用网络资源来实现与最佳性能启发式的相同的接受率,并且可以在直接应用(没有进一步培训的)$ 10^2 \ times $ $的服务器上,可以保持全方位的性能(没有进一步的培训),而不是在培训中看到的拓扑多。

Resource-disaggregated data centre architectures promise a means of pooling resources remotely within data centres, allowing for both more flexibility and resource efficiency underlying the increasingly important infrastructure-as-a-service business. This can be accomplished by means of using an optically circuit switched backbone in the data centre network (DCN); providing the required bandwidth and latency guarantees to ensure reliable performance when applications are run across non-local resource pools. However, resource allocation in this scenario requires both server-level \emph{and} network-level resource to be co-allocated to requests. The online nature and underlying combinatorial complexity of this problem, alongside the typical scale of DCN topologies, makes exact solutions impossible and heuristic based solutions sub-optimal or non-intuitive to design. We demonstrate that \emph{deep reinforcement learning}, where the policy is modelled by a \emph{graph neural network} can be used to learn effective \emph{network-aware} and \emph{topologically-scalable} allocation policies end-to-end. Compared to state-of-the-art heuristics for network-aware resource allocation, the method achieves up to $20\%$ higher acceptance ratio; can achieve the same acceptance ratio as the best performing heuristic with $3\times$ less networking resources available and can maintain all-around performance when directly applied (with no further training) to DCN topologies with $10^2\times$ more servers than the topologies seen during training.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源