论文标题
使用编码梯度的参数服务器的Straggler-blobust分布式优化
Straggler-Robust Distributed Optimization with the Parameter Server Utilizing Coded Gradient
论文作者
论文摘要
分布式网络中的优化在几乎所有分布式机器学习问题中都起着核心作用。原则上,分布式任务分配的使用减少了计算时间,从而允许更好的响应率和更高的数据可靠性。但是,要使这些计算算法在复杂的分布式系统中有效运行,该算法应弥补通信异步,以及网络节点失败和延迟称为Stragglers。这些问题可能会改变网络的有效连接拓扑,该拓扑可能会随着时间而变化,从而阻碍了优化过程。在本文中,我们提出了一种新的分布式不受约束的优化算法,以最大程度地减少适应于参数服务器网络的强凸功能。特别是,网络工作者节点解决了其本地优化问题,允许计算其本地编码梯度,并将其发送到不同的服务器节点。然后,每个服务器节点汇总了其传达的本地梯度,从而允许收敛到所需的优化器。该算法对于网络工作节点失败或断开连接或称为散散的延迟是可靠的。克服Straggler问题的一种方法是允许通过网络进行编码。我们进一步扩展了此编码框架,以增强在这种不同的网络拓扑结构下提出的算法的收敛性。最后,我们在MATLAB中实施了拟议的方案,并提供了比较结果,证明了拟议框架的有效性。
Optimization in distributed networks plays a central role in almost all distributed machine learning problems. In principle, the use of distributed task allocation has reduced the computational time, allowing better response rates and higher data reliability. However, for these computational algorithms to run effectively in complex distributed systems, the algorithms ought to compensate for communication asynchrony, and network node failures and delays known as stragglers. These issues can change the effective connection topology of the network, which may vary through time, thus hindering the optimization process. In this paper, we propose a new distributed unconstrained optimization algorithm for minimizing a strongly convex function which is adaptable to a parameter server network. In particular, the network worker nodes solve their local optimization problems, allowing the computation of their local coded gradients, and send them to different server nodes. Then each server node aggregates its communicated local gradients, allowing convergence to the desired optimizer. This algorithm is robust to network worker node failures or disconnection, or delays known as stragglers. One way to overcome the straggler problem is to allow coding over the network. We further extend this coding framework to enhance the convergence of the proposed algorithm under such varying network topologies. Finally, we implement the proposed scheme in MATLAB and provide comparative results demonstrating the effectiveness of the proposed framework.