论文标题
Autotsg:事件故障排除的学习和综合
AutoTSG: Learning and Synthesis for Incident Troubleshooting
论文作者
论文摘要
事件管理是操作大规模云服务的关键方面。为了帮助更快,更有效地解决事件,工程团队以故障排除指南(TSG)的形式记录了频繁的故障排除步骤,该步骤将由呼叫工程师(OCES)使用。但是,TSG是孤立的,非结构化的,并且通常不完整,要求开发人员手动理解和执行必要的步骤。这导致了很多问题,例如临时疲劳,生产力降低和人类错误。在这项工作中,我们对超过4K+ TSG进行了大规模实证研究,该研究映射到1000件事,发现TSG被广泛使用,并有助于大大减少缓解工作。然后,我们分析了400多种OCE提供的TSG的反馈,并提出了一个分类学问题,这些问题突出了TSG质量的差距。为了减轻这些差距,我们研究了TSG的自动化并提出了AUTOTSG,这是一个新颖的TSG自动化框架,通过结合机器学习和程序合成来实现可执行的工作流程。我们对50个TSG的AUTOTSG的评估显示了识别TSG语句(准确性0.89)和解析其执行(精度为0.94和召回0.91)的有效性。最后,我们调查了十名Microsoft工程师,并显示了TSG自动化的重要性和AUTOTSG的实用性。
Incident management is a key aspect of operating large-scale cloud services. To aid with faster and efficient resolution of incidents, engineering teams document frequent troubleshooting steps in the form of Troubleshooting Guides (TSGs), to be used by on-call engineers (OCEs). However, TSGs are siloed, unstructured, and often incomplete, requiring developers to manually understand and execute necessary steps. This results in a plethora of issues such as on-call fatigue, reduced productivity, and human errors. In this work, we conduct a large-scale empirical study of over 4K+ TSGs mapped to 1000s of incidents and find that TSGs are widely used and help significantly reduce mitigation efforts. We then analyze feedback on TSGs provided by 400+ OCEs and propose a taxonomy of issues that highlights significant gaps in TSG quality. To alleviate these gaps, we investigate the automation of TSGs and propose AutoTSG -- a novel framework for automation of TSGs to executable workflows by combining machine learning and program synthesis. Our evaluation of AutoTSG on 50 TSGs shows the effectiveness in both identifying TSG statements (accuracy 0.89) and parsing them for execution (precision 0.94 and recall 0.91). Lastly, we survey ten Microsoft engineers and show the importance of TSG automation and the usefulness of AutoTSG.