论文标题

HPC系统中的在线故障分类的机器学习方法

A Machine Learning Approach to Online Fault Classification in HPC Systems

论文作者

Netti, Alessio, Kiziltan, Zeynep, Babaoglu, Ozalp, Sirbu, Alina, Bartolini, Andrea, Borghesi, Andrea

论文摘要

随着高性能计算(HPC)系统努力实现Exascale目标,硬件和软件水平的故障率将大大增加。因此,检测和分类HPC系统发生的故障并在纠正措施转化为失败之前对纠正措施进行启动,对于继续操作至关重要。该目标的核心是断层注射,这是系统中故意触发故障,以观察其在受控环境中的行为。在本文中,我们建议基于机器学习的HPC系统的故障分类方法。我们方法的新颖性在于它可以以在线方式对流数据进行操作,从而开放了实时设计和制定控制目标系统的可能性。我们引入了一个名为FinJ的高级,易于使用的故障注入工具,重点是对复杂实验的管理。为了训练和评估我们的机器学习分类器,我们使用FINJ向内部实验HPC系统注入故障,并生成一个错误数据集,我们广泛描述了该数据集。 FINJ和数据集都可以公开使用,以促进HPC系统领域的弹性研究。实验结果表明,我们的方法允许与低计算开销和最小延迟的不同故障类型达到几乎完美的分类精度。

As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at the hardware and software levels will increase significantly. Thus, detecting and classifying faults in HPC systems as they occur and initiating corrective actions before they can transform into failures becomes essential for continued operation. Central to this objective is fault injection, which is the deliberate triggering of faults in a system so as to observe their behavior in a controlled environment. In this paper, we propose a fault classification method for HPC systems based on machine learning. The novelty of our approach rests with the fact that it can be operated on streamed data in an online manner, thus opening the possibility to devise and enact control actions on the target system in real-time. We introduce a high-level, easy-to-use fault injection tool called FINJ, with a focus on the management of complex experiments. In order to train and evaluate our machine learning classifiers, we inject faults to an in-house experimental HPC system using FINJ, and generate a fault dataset which we describe extensively. Both FINJ and the dataset are publicly available to facilitate resiliency research in the HPC systems field. Experimental results demonstrate that our approach allows almost perfect classification accuracy to be reached for different fault types with low computational overhead and minimal delay.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源