论文标题
大力神:异质性 - 感知推理用于在尺度个性化建议
Hercules: Heterogeneity-Aware Inference Serving for At-Scale Personalized Recommendation
论文作者
论文摘要
个性化建议是一类重要的深入学习应用程序类,可为大量的互联网服务提供动力,并消耗大量的数据中心资源。随着生产级推荐系统的规模不断增长,优化其在异质数据中心中的服务性能和效率很重要,并且可以转化为基础架构的节省。在本文中,我们提出了Hercules,这是一个针对个性化推荐推理服务的优化框架,该框架针对各种行业代表性模型和云规模的异质系统。大力神执行两阶段的优化程序 - 离线分析和在线服务。第一阶段通过基于梯度的搜索算法实现高达9.0倍的延迟吞吐量改进,搜索大型爆发的任务调度空间。它还为每个推荐工作负载标识了最佳的异质服务器体系结构。第二阶段执行异质性吸引集群的配置,以优化资源映射和分配,以响应波动的昼夜负载。拟议中的大力神调度程序可实现47.7%的群集节省,并在最先进的贪婪调度程序中降低了配置功率的23.7%。
Personalized recommendation is an important class of deep-learning applications that powers a large collection of internet services and consumes a considerable amount of datacenter resources. As the scale of production-grade recommendation systems continues to grow, optimizing their serving performance and efficiency in a heterogeneous datacenter is important and can translate into infrastructure capacity saving. In this paper, we propose Hercules, an optimized framework for personalized recommendation inference serving that targets diverse industry-representative models and cloud-scale heterogeneous systems. Hercules performs a two-stage optimization procedure - offline profiling and online serving. The first stage searches the large under-explored task scheduling space with a gradient-based search algorithm achieving up to 9.0x latency-bounded throughput improvement on individual servers; it also identifies the optimal heterogeneous server architecture for each recommendation workload. The second stage performs heterogeneity-aware cluster provisioning to optimize resource mapping and allocation in response to fluctuating diurnal loads. The proposed cluster scheduler in Hercules achieves 47.7% cluster capacity saving and reduces the provisioned power by 23.7% over a state-of-the-art greedy scheduler.