大力神：异质性 - 感知推理用于在尺度个性化建议

论文标题

大力神：异质性 - 感知推理用于在尺度个性化建议

Hercules: Heterogeneity-Aware Inference Serving for At-Scale Personalized Recommendation

论文作者

Ke, Liu, Gupta, Udit, Hempstead, Mark, Wu, Carole-Jean, Lee, Hsien-Hsin S., Zhang, Xuan

论文摘要

个性化建议是一类重要的深入学习应用程序类，可为大量的互联网服务提供动力，并消耗大量的数据中心资源。随着生产级推荐系统的规模不断增长，优化其在异质数据中心中的服务性能和效率很重要，并且可以转化为基础架构的节省。在本文中，我们提出了Hercules，这是一个针对个性化推荐推理服务的优化框架，该框架针对各种行业代表性模型和云规模的异质系统。大力神执行两阶段的优化程序 - 离线分析和在线服务。第一阶段通过基于梯度的搜索算法实现高达9.0倍的延迟吞吐量改进，搜索大型爆发的任务调度空间。它还为每个推荐工作负载标识了最佳的异质服务器体系结构。第二阶段执行异质性吸引集群的配置，以优化资源映射和分配，以响应波动的昼夜负载。拟议中的大力神调度程序可实现47.7％的群集节省，并在最先进的贪婪调度程序中降低了配置功率的23.7％。

Personalized recommendation is an important class of deep-learning applications that powers a large collection of internet services and consumes a considerable amount of datacenter resources. As the scale of production-grade recommendation systems continues to grow, optimizing their serving performance and efficiency in a heterogeneous datacenter is important and can translate into infrastructure capacity saving. In this paper, we propose Hercules, an optimized framework for personalized recommendation inference serving that targets diverse industry-representative models and cloud-scale heterogeneous systems. Hercules performs a two-stage optimization procedure - offline profiling and online serving. The first stage searches the large under-explored task scheduling space with a gradient-based search algorithm achieving up to 9.0x latency-bounded throughput improvement on individual servers; it also identifies the optimal heterogeneous server architecture for each recommendation workload. The second stage performs heterogeneity-aware cluster provisioning to optimize resource mapping and allocation in response to fluctuating diurnal loads. The proposed cluster scheduler in Hercules achieves 47.7% cluster capacity saving and reduces the provisioned power by 23.7% over a state-of-the-art greedy scheduler.

下载PDF全文

下载文献需遵守相关版权规定

论文标题