论文标题
同一天交付问题的分散的多项式增强学习方法
A Decentralised Multi-Agent Reinforcement Learning Approach for the Same-Day Delivery Problem
论文作者
论文摘要
近年来,当天的交付服务越来越流行。这些通常以先前的研究为建模,作为一类动态车辆路由问题(DVRP),其中必须在下达订单的同一天将商品从仓库交付到一组客户。 DVRP的自适应精确解决方案方法即使对于小问题实例也可能变得棘手。在本文中,我们将SDDP作为马尔可夫决策过程(MDP)提出,并使用参数共享的深Q-Network解决它,该Q网络对应于分散的多代理增强学习(MARL)方法。为此,我们创建了一个基于多代理网格的SDD环境,该环境由多个车辆,中央仓库和动态订单生成组成。此外,我们介绍了特定于区域的订单生成和奖励概率。我们将提出的MAL方法的性能与混合编程(MIP)解决方案进行比较。结果表明,当订单数量相对较低时,我们提出的MARL框架与基于MIP的策略相同。对于具有较高阶段到达率的问题实例,计算结果表明,MAL方法的表现不足30%。当使用区域特异性参数时,两种方法之间的性能差距都会较小。对于30个订单的5x5网格场景,差距从30%降低到3%。执行时间结果表明,MAL方法平均比基于MIP的策略快65倍,因此至少对于小型实例,对于实时控制可能更有优势。
Same-Day Delivery services are becoming increasingly popular in recent years. These have been usually modelled by previous studies as a certain class of Dynamic Vehicle Routing Problem (DVRP) where goods must be delivered from a depot to a set of customers in the same day that the orders were placed. Adaptive exact solution methods for DVRPs can become intractable even for small problem instances. In this paper, we formulate the SDDP as a Markov Decision Process (MDP) and solve it using a parameter-sharing Deep Q-Network, which corresponds to a decentralised Multi-Agent Reinforcement Learning (MARL) approach. For this, we create a multi-agent grid-based SDD environment, consisting of multiple vehicles, a central depot and dynamic order generation. In addition, we introduce zone-specific order generation and reward probabilities. We compare the performance of our proposed MARL approach against a Mixed Inter Programming (MIP) solution. Results show that our proposed MARL framework performs on par with MIP-based policy when the number of orders is relatively low. For problem instances with higher order arrival rates, computational results show that the MARL approach underperforms the MIP by up to 30%. The performance gap between both methods becomes smaller when zone-specific parameters are employed. The gap is reduced from 30% to 3% for a 5x5 grid scenario with 30 orders. Execution time results indicate that the MARL approach is, on average, 65 times faster than the MIP-based policy, and therefore may be more advantageous for real-time control, at least for small-sized instances.