论文标题

SSD和HDD的生与死:相似性,差异和预测模型

The Life and Death of SSDs and HDDs: Similarities, Differences, and Prediction Models

论文作者

Pinciroli, Riccardo, Yang, Lishan, Alter, Jacob, Smirni, Evgenia

论文摘要

数据中心停机时间通常围绕IT设备故障。存储设备是数据中心中最常见的组件。我们提出了构成数据中心典型存储的硬盘驱动器(HDD)和固态驱动器(SSD)的比较研究。我们使用来自Backblaze数据集的同一制造商的100,000个模型的100,000 HDD的现场数据,以及来自Google数据中心的三个模型的30,000个SSD的六年现场数据,我们表征了导致失败的工作负载条件,并说明其根本原因与常见期望不同,但仍难以辨别。对于HDD的情况,我们观察到年轻人和老年驱动器在失败上没有很多差异。取而代之的是,可以通过根据头部定位时间来区分驱动器来区分故障。对于SSD,我们观察到婴儿死亡率的高水平,并表征了婴儿和非侵害失败之间的差异。我们开发了几种机器学习失败预测模型,这些模型表现出令人惊讶的准确性,可实现高召回率和低误报率。这些模型超出了简单的预测,因为它们有助于我们解开工作量特征的复杂相互作用,从而导致失败并确定失败的症状根本原因。

Data center downtime typically centers around IT equipment failure. Storage devices are the most frequently failing components in data centers. We present a comparative study of hard disk drives (HDDs) and solid state drives (SSDs) that constitute the typical storage in data centers. Using a six-year field data of 100,000 HDDs of different models from the same manufacturer from the BackBlaze dataset and a six-year field data of 30,000 SSDs of three models from a Google data center, we characterize the workload conditions that lead to failures and illustrate that their root causes differ from common expectation but remain difficult to discern. For the case of HDDs we observe that young and old drives do not present many differences in their failures. Instead, failures may be distinguished by discriminating drives based on the time spent for head positioning. For SSDs, we observe high levels of infant mortality and characterize the differences between infant and non-infant failures. We develop several machine learning failure prediction models that are shown to be surprisingly accurate, achieving high recall and low false positive rates. These models are used beyond simple prediction as they aid us to untangle the complex interaction of workload characteristics that lead to failures and identify failure root causes from monitored symptoms.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源