论文标题

解雇:用于云应用的细粒度稳健性能诊断框架

FIRED: a fine-grained robust performance diagnosis framework for cloud applications

论文作者

Xin, Ruyue, Liu, Hongyun, Chen, Peng, Grosso, Paola, Zhao, Zhiming

论文摘要

要运行具有所需服务质量的云应用程序,操作员必须不断监视云应用程序的运行时状态,检测潜在的性能异常并诊断异常的根本原因。但是,由于监测系统级指标的多样性以及缺乏高质量标记的监测数据,因此现有的性能异常检测模型通常会遭受低重复使用和鲁棒性的影响。此外,当前的粗粒分析模型使得很难找到应用程序性能异常的系统级根本原因以进行有效的适应决策。我们提供了精细的稳健性能诊断(解雇)框架,以应对这些挑战。该框架为使用深神经网络提供了几种精心挑选的基本模型,用于异常检测,该网络采用了弱监督的学习,考虑到现实中存在更少的标签。该框架还采用实时细粒分析模型来定位异常的依赖系统指标。我们的实验表明,该框架可以达到最佳检测精度和算法鲁棒性,并且可以预测四分钟内的异常,而F1得分高于0.8。此外,该框架可以准确地定位第一个根本原因,并且平均精度高于定位前四个根本原因的0.7。

To run a cloud application with the required service quality, operators have to continuously monitor the cloud application's run-time status, detect potential performance anomalies, and diagnose the root causes of anomalies. However, existing models of performance anomaly detection often suffer from low re-usability and robustness due to the diversity of system-level metrics being monitored and the lack of high-quality labeled monitoring data for anomalies. Moreover, the current coarse-grained analysis models make it difficult to locate system-level root causes of the application performance anomalies for effective adaptation decisions. We provide a FIne-grained Robust pErformance Diagnosis (FIRED) framework to tackle those challenges. The framework offers an ensemble of several well-selected base models for anomaly detection using a deep neural network, which adopts weakly-supervised learning considering fewer labels exist in reality. The framework also employs a real-time fine-grained analysis model to locate dependent system metrics of the anomaly. Our experiments show that the framework can achieve the best detection accuracy and algorithm robustness, and it can predict anomalies in four minutes with F1 score higher than 0.8. In addition, the framework can accurately localize the first root causes, and with an average accuracy higher than 0.7 of locating first four root causes.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源