论文标题
NURD:在线数据中心Straggler预测的否定未标记的学习
NURD: Negative-Unlabeled Learning for Online Datacenter Straggler Prediction
论文作者
论文摘要
数据中心执行由较小任务组成的大型计算作业。当工作完成所有任务时,工作就完成了,因此,散乱的人(罕见但非常缓慢的任务)是对数据中心性能的主要障碍。准确地预测散乱者将实现主动的干预,从而使数据中心操作员在延迟工作之前可以减轻散乱者。尽管许多先前的工作应用机器学习来预测计算机系统的性能,但这些方法依赖于完整的标签(即所有可能的行为的足够示例,包括散布和非散布 - 或对基本潜伏期分布的强有力的假设 - 例如,无论是高斯是否。但是,在运行的工作中,直到Stragglers已经延迟工作后才揭示自己,这些信息都无法使用。为了准确,早期预测散乱者,而没有标记为潜伏期分布的积极示例或假设,本文介绍了Nurd,这是一种新型的负面标记的学习方法,并具有重新加权和分布补偿,仅在负面和未标记的流媒体数据上培训。关键想法是使用非stragglers的完成任务来训练预测器,以预测未标记的运行任务的延迟,然后根据其功能空间的加权功能重新授予每个未标记的任务的预测。我们评估了Google和Alibaba的两个生产痕迹的NURD,并发现与最佳基线方法相比,NURD在预测准确性方面产生了F1分数的2---11个百分点,而工作完成时间的2.0---8.8个百分点提高了。
Datacenters execute large computational jobs, which are composed of smaller tasks. A job completes when all its tasks finish, so stragglers -- rare, yet extremely slow tasks -- are a major impediment to datacenter performance. Accurately predicting stragglers would enable proactive intervention, allowing datacenter operators to mitigate stragglers before they delay a job. While much prior work applies machine learning to predict computer system performance, these approaches rely on complete labels -- i.e., sufficient examples of all possible behaviors, including straggling and non-straggling -- or strong assumptions about the underlying latency distributions -- e.g., whether Gaussian or not. Within a running job, however, none of this information is available until stragglers have revealed themselves when they have already delayed the job. To predict stragglers accurately and early without labeled positive examples or assumptions on latency distributions, this paper presents NURD, a novel Negative-Unlabeled learning approach with Reweighting and Distribution-compensation that only trains on negative and unlabeled streaming data. The key idea is to train a predictor using finished tasks of non-stragglers to predict latency for unlabeled running tasks, and then reweight each unlabeled task's prediction based on a weighting function of its feature space. We evaluate NURD on two production traces from Google and Alibaba, and find that compared to the best baseline approach, NURD produces 2--11 percentage point increases in the F1 score in terms of prediction accuracy, and 2.0--8.8 percentage point improvements in job completion time.