懒惰：用于云机学习推理的SLA感知批处理系统

论文标题

懒惰：用于云机学习推理的SLA感知批处理系统

LazyBatching: An SLA-aware Batching System for Cloud Machine Learning Inference

论文作者

Choi, Yujeong, Kim, Yunseong, Rhu, Minsoo

论文摘要

在Cloud ML推理系统中，批处理是增加吞吐量的必不可少的技术，有助于优化总成本所有权。先前的图形批处理将单个DNN图组合到一个单个图中，允许并行执行多个输入。我们观察到，在有效地处理动态推理请求流量时，粗粒的图形批处理变得次优，剩下的表现出色。本文提出了LazyBatching是一种SLA感知的批处理系统，该系统考虑了单个图节点的粒度，而不是整个图表，而不是用于灵活批处理的整个图。我们表明，懒散的订婚可以明智地确定可以有效分批分组的节点的集合，而比图分别分别在平均响应时间，吞吐量和SLA满意度方面，平均提高了15倍，1.5倍和5.5倍的改善。

In cloud ML inference systems, batching is an essential technique to increase throughput which helps optimize total-cost-of-ownership. Prior graph batching combines the individual DNN graphs into a single one, allowing multiple inputs to be concurrently executed in parallel. We observe that the coarse-grained graph batching becomes suboptimal in effectively handling the dynamic inference request traffic, leaving significant performance left on the table. This paper proposes LazyBatching, an SLA-aware batching system that considers both scheduling and batching in the granularity of individual graph nodes, rather than the entire graph for flexible batching. We show that LazyBatching can intelligently determine the set of nodes that can be efficiently batched together, achieving an average 15x, 1.5x, and 5.5x improvement than graph batching in terms of average response time, throughput, and SLA satisfaction, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题