论文标题
CMMD:跨金属多维根本原因分析
CMMD: Cross-Metric Multi-Dimensional Root Cause Analysis
论文作者
论文摘要
在大规模的在线服务中,请定期监视关键指标,又称关键绩效指标(KPI),以检查其运行状态。通常,KPI沿多个维度汇总,并通过原始数据的基本指标中的复杂计算得出。一旦观察到异常的KPI值,就可以应用根本原因分析(RCA)来确定异常的原因,以便我们可以快速进行故障排除。最近,提出了几种自动RCA技术来定位相关维度(或尺寸组合)以解释异常。但是,他们的分析仅限于异常度量的数据,而忽略了其他指标的数据,这可能与异常有关,从而导致不精确甚至不正确的根本原因。为此,我们提出了一种名为CMMD的跨金属多维根本原因分析方法,该方法由两个关键组成部分组成:1)关系建模,该方法利用图形神经网络(GNN)来模拟指标之间的不明复杂计算和在历史数据中的尺寸之间的汇总功能之间的汇总函数; 2)根本原因定位,它采用遗传算法有效地有效地潜入原始数据并一旦检测到KPI异常,并将异常维度定位。与基准相比,关于合成数据集,公共数据集和在线生产环境的实验证明了我们提出的CMMD方法的优势。当前,CMMD正在Microsoft Azure作为在线服务运行。
In large-scale online services, crucial metrics, a.k.a., key performance indicators (KPIs), are monitored periodically to check their running statuses. Generally, KPIs are aggregated along multiple dimensions and derived by complex calculations among fundamental metrics from the raw data. Once abnormal KPI values are observed, root cause analysis (RCA) can be applied to identify the reasons for anomalies, so that we can troubleshoot quickly. Recently, several automatic RCA techniques were proposed to localize the related dimensions (or a combination of dimensions) to explain the anomalies. However, their analyses are limited to the data on the abnormal metric and ignore the data of other metrics which may be also related to the anomalies, leading to imprecise or even incorrect root causes. To this end, we propose a cross-metric multi-dimensional root cause analysis method, named CMMD, which consists of two key components: 1) relationship modeling, which utilizes graph neural network (GNN) to model the unknown complex calculation among metrics and aggregation function among dimensions from historical data; 2) root cause localization, which adopts the genetic algorithm to efficiently and effectively dive into the raw data and localize the abnormal dimension(s) once the KPI anomalies are detected. Experiments on synthetic datasets, public datasets and online production environment demonstrate the superiority of our proposed CMMD method compared with baselines. Currently, CMMD is running as an online service in Microsoft Azure.