眨眼：轻巧的样本运行以优化大数据应用

论文标题

眨眼：轻巧的样本运行以优化大数据应用

Blink: Lightweight Sample Runs for Cost Optimization of Big Data Applications

论文作者

Al-Sayeh, Hani, Jibril, Muhammad Attahir, Memishi, Bunjamin, Sattler, Kai-Uwe

论文摘要

分布式内存中数据处理引擎通过在内存中缓存大量数据来加速迭代应用，而不是在每次迭代中重新计算它们。选择用于缓存这些数据集的合适群集大小在实现最佳性能中起着至关重要的作用。实际上，对于最终用户而言，这是一项繁琐而艰巨的任务，他们通常不知道集群规格，工作负载语义和中间数据的大小。我们提出眨眼，这是一种基于自主抽样的框架，该框架可以预测缓存的数据集的尺寸，并选择最佳的群集大小而不依赖历史运行。我们评估了各种迭代，现实世界中的机器学习应用程序的眨眼。与最佳运行成本相比，平均样本运行成本为4.6％，眨眼选择了15个案例中15个案例的最佳群集大小，与平均成本相比，可节省高达47.4％的执行成本。

Distributed in-memory data processing engines accelerate iterative applications by caching substantial datasets in memory rather than recomputing them in each iteration. Selecting a suitable cluster size for caching these datasets plays an essential role in achieving optimal performance. In practice, this is a tedious and hard task for end users, who are typically not aware of cluster specifications, workload semantics and sizes of intermediate data. We present Blink, an autonomous sampling-based framework, which predicts sizes of cached datasets and selects optimal cluster size without relying on historical runs. We evaluate Blink on a variety of iterative, real-world, machine learning applications. With an average sample runs cost of 4.6% compared to the cost of optimal runs, Blink selects the optimal cluster size in 15 out of 16 cases, saving up to 47.4% of execution cost compared to average costs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题