第一个CE问题：关于长期属性对内存失败预测的重要性

论文标题

第一个CE问题：关于长期属性对内存失败预测的重要性

First CE Matters: On the Importance of Long Term Properties on Memory Failure Prediction

论文作者

Bogatinovski, Jasmin, Yu, Qiao, Cardoso, Jorge, Kao, Odej

论文摘要

动态随机访问记忆失败会威胁到数据中心的可靠性，因为它们导致数据丢失和系统崩溃。及时预测内存故障允许采取预防措施，例如服务器迁移和内存更换。因此，内存故障预测阻止了失败的外部化，这是提高系统可靠性的重要任务。在本文中，我们重新审视了内存故障预测的问题。我们分析了硬件日志中的可更正错误（CES），作为降级内存状态的指标。由于记忆并不总是能够完全占用，因此可以分布时间来访问错误的内存零件。在此直觉之后，我们观察到，记忆失败预测的重要属性是通过长时间间隔分布的。相比之下，相关研究以适合实际的约束，通常只能分析上一个固定尺寸的时间间隔中的CE，同时忽略预期信息。在观察到的差异的激励下，我们研究了包括整体（远程）CE进化的影响，并提出了新的特征，这些特征是通过逐步计算的，以保留远程特性。通过将提取的功能与机器学习方法耦合，我们学习了一个预测模型，可以预测即将发生的三个小时，同时提前三个小时提高相对精确度，并相应地提高21％和19％的召回率。我们评估了大型云提供商服务器机队实际内存故障的方法，证明其有效性和实用性是合理的。

Dynamic random access memory failures are a threat to the reliability of data centres as they lead to data loss and system crashes. Timely predictions of memory failures allow for taking preventive measures such as server migration and memory replacement. Thereby, memory failure prediction prevents failures from externalizing, and it is a vital task to improve system reliability. In this paper, we revisited the problem of memory failure prediction. We analyzed the correctable errors (CEs) from hardware logs as indicators for a degraded memory state. As memories do not always work with full occupancy, access to faulty memory parts is time distributed. Following this intuition, we observed that important properties for memory failure prediction are distributed through long time intervals. In contrast, related studies, to fit practical constraints, frequently only analyze the CEs from the last fixed-size time interval while ignoring the predating information. Motivated by the observed discrepancy, we study the impact of including the overall (long-range) CE evolution and propose novel features that are calculated incrementally to preserve long-range properties. By coupling the extracted features with machine learning methods, we learn a predictive model to anticipate upcoming failures three hours in advance while improving the average relative precision and recall for 21% and 19% accordingly. We evaluated our methodology on real-world memory failures from the server fleet of a large cloud provider, justifying its validity and practicality.

下载PDF全文

下载文献需遵守相关版权规定

论文标题