Backblaze Inc.

06/08/2024 | Press release | Distributed by Public on 06/08/2024 20:20

Backblaze Drive Stats for Q2 2024

As of the end of Q2 2024, Backblaze was monitoring 288,665 hard drives (HDDs) and solid state drives (SSDs) in our cloud storage servers located in our data centers around the world. We removed from this analysis 3,789 boot drives, consisting of 2,923 SSDs and 866 hard drives. This leaves us with 284,876 hard drives under management to review for this report. We'll review the annualized failure rates (AFRs) for Q2 2024 and the lifetime AFRs of the qualifying drive models, and we'll also check out drive age versus failure rates over time. Along the way, we'll share our observations and insights on the data presented and, as always, we look forward to you doing the same in the comments section at the end of the post.

Hard drive failure rates for Q2 2024

For our Q2 2024 quarterly analysis, we remove from consideration: drive models which did have at least 100 drives in service at the end of the quarter, drive models which did not accumulate 10,000 or more drive days during the quarter, and individual drives which exceeded their manufacturer's temperature specification during their lifetime. The removed pool totalled 490 drives, leaving us with 284,386 drives grouped into 29 drive models for our Q2 2024 analysis.

The table below lists the AFRs and related data for these drive models. The table is sorted large to small by drive size then by AFR within drive size.

Notes and observations on the Q2 2024 Drive Stats

  • Upward AFR: The AFR for Q2 2024 was 1.71%. That's up from Q1 2024 at 1.41%, but down from one year ago (Q2 2023) at 2.28%. While the quarter over quarter increase was a bit surprising, quarterly fluctuations in AFR are expected. Sixteen drive models had an AFR of 1.71% or below while 13 drive models had an AFR above.
  • Two good zeroes: In Q2 2024, two drive models had zero failures, a 14TB Seagate (model: ST14000NM000J) and a 16TB Seagate (model: ST16000NM002J). Both have a relatively small number of drives and drive days for the quarter, so their success is somewhat muted, but the 16TB Seagate drive model has a very respectable 0.57% lifetime failure rate.
  • Another GOAT is gone: In Q1, we migrated the last of our 4TB Toshiba drives. In Q2, we migrated the last of our 6TB drives, including all of the Seagate 6TB drives which had reached an average age of nine years (108 months). This Seagate drive model closed out its career at Backblaze with an impressive 0.86% lifetime AFR.

    Currently the 4TB Seagate (model: ST4000DM000) is our oldest data drive model in production at an average age of 99.5 months. The data on these drives is scheduled to be migrated over the next quarter or two using CVT, our in-house drive migration system. They'll never reach nine years of service.
  • The 10-Year Club: With the 6TB Seagate drives being migrated as they hit 10 years of service, we wondered: What is the oldest data drive in service? The answer, a 4TB HGST drive (model: HMS5C4040ALE640) with 9 years, 11 months and 23 days service as of the end of Q2. Alas, the Backblaze Vault in which this drive resides is now being migrated as are many other drives with over nine years of service. We'll see next quarter to see if any of them made it to the 10-Year Club before they are retired.

    While there are no data drives with 10 years of service, there are 11 HDD boot drives that exceed the mark. In fact one, a 500GB WD drive (model: WD5000BPKT) has over 11 years of service. (Psst, don't tell the CVT team.)
  • An HGST surprise: Over the years, the HGST drive models we have used performed very well. So, when the 12TB HGST (model: HUH721212ALN604) drive showed up with a 7.17% AFR for Q2, it's news. Such uncharacteristic quarterly failure rates for this model actually go back about a year, although the 7.17% AFR is the largest quarterly value to date. As a result, the lifetime AFR has risen from 0.99% to 1.57% over the last year. While the lifetime AFR is not alarming, we are paying attention to this trend.

Lifetime hard drive failure rates

As of the end of Q2 2024, we were tracking 284,876 operational hard drives. To be considered for the lifetime review, a drive model was required to have 500 or more drives as of the end of Q2 2024 and have over 100,000 accumulated drive days during their lifetime. When we removed those drive models which did not meet the lifetime criteria, we had 283,065 drives grouped into 25 models remaining for analysis as shown in the table below.

Age, AFR, and snakes

One of the truisms in our business is that different drive models fail at different rates. Our goal is to develop a failure profile for a given drive model over time. Such a profile can help optimize our drive replacement and migration strategies, and ultimately maintains the durability of our cloud storage service.

For our cohort of data drives, we'll look at the changes in the lifetime AFR over time for drive models with at least one million drive days as of the end of Q2 2024. This gives us 23 drive models to review. We'll divide the drive models into two groups: those whose average age is five years (60 months) or less, and those whose average age is above 60 months. Why that cutoff? That's the typical warranty period for enterprise class hard drives.

Let's start by plotting the current lifetime AFR for the 14 drives models that have an average age of 60 months or less as shown in the chart below.

Let's review the drive models by characterizing the four quadrants as follows:

  • Quadrant I: Drive models in this quadrant are performing well, and have a respectable AFR of less than 1.5%. Drive models to the right in this quadrant might require a little more attention over the coming months than those to the left.
  • Quadrant II: These drive models have failure rates above 1.5%, but are still reasonable at around 2% lifetime AFR. What is important is that AFR does not increase significantly over time.
  • Quadrant III: There are no drives currently in this quadrant, but if there were it would not be a cause for alarm. Why? Some drive models experience higher rates of failure early on, and then following the bathtub curve, their AFR drops as they get older.
  • Quadrant IV: These drive models are just starting out and are just beginning to establish their failure profile, which at the moment is good.

At a glance, the chart tells us that everything seems fine. The drives in Quadrant I are performing well, the two drives in Quadrant II could be better, but are still acceptable, and there are no surprises in the newer drive models to this point. Let's see how things fair for the drive models which have an average age of over 60 months as in the chart below.

There are nine drive models which fit the average age criteria, including the Seagate 6TB drive (in yellow) whose drives were removed from service in Q2. As you can see the drive models are spread out across all four quadrants. As before, Quadrant I contains good drives, Quadrants II and III are drives we need to worry about, and Quadrant IV models look good so far.

If we were to stop here we could decide for example that the 4TB Seagate drives are first in line for the CVT migration process, but not so fast. All of these drive models have been around for at least five years and we have their failure rates over time. So, rather than rely on just a point in time, let's look at their change in failure rates over time in the chart below.

The snake chart, as we're calling it, shows the lifetime failure rate of each drive model over time. We started at 24 months to make the chart less messy. Regardless, the drive models sort themselves out into either Quadrant I or II once their average age passes 60 months. Let's take a look at the drives in each of those quadrants.

  • Quadrant I: Five of the nine drive models are in Quadrant I as of Q2 2024. The two 4TB HGST drives (brown and purple lines) as well as the 6TB Seagate (red line) have nearly vertical lines indicating their failure rates have been consistent over time, especially after 60 months of service. Such demonstrated consistency over time is a failure profile we like to see.

    The failure profile of the 8TB Seagate (blue line) and the 8TB HGST (gray line) are less consistent, with each increasing their failure rates as they have aged. In the case of the HGST drive, the lifetime AFR rose from about 0.5% to 1.0% over an 18 month period starting at 48 months before leveling out. The Seagate drive took about two years starting at 60 months to go from 1.0% to nearly 1.5% before leveling out.
  • Quadrant II: The remaining 4 drive models ended in this quadrant. Three of the models, the 8TB Seagate (yellow line), the 10TB Seagate (green line), and the 12TB HGST (teal line) have similar failure profiles. All three got to some point in their lifetime and their curve began bending to the right. In other words, their failure rates over time accelerated. While the 8TB Seagate (yellow) shows some signs of leveling off, all three models will be closely watched and replaced if this trend continues.

    Also in Quadrant II is the 4TB Seagate drive (black line). This drive model is aggressively being migrated and is being replaced by 16TB and larger drives via the CVT process. As such, it is hard to tell if the nearly vertical failure profile is a function of the replacement process or the drive model failure rate leveling out over time. Either way, the migration of this drive model is expected to be complete in the next quarter or two.

A normal failure profile

If we had to pick one of the drive models to represent a normal failure profile, it would be the 8TB Seagate (blue line, model: ST800DM002). Why? The failure rate for the first 60 months was consistently around 1.0%, Seagate's predicted AFR. After 60 months, the AFR increased as the drive aged as one would expect. You might have thought we'd choose the failure profile of one of the two 4TB HGST drive models (brown and purple lines). The "trouble" is their failure rates are well below any published AFR by any drive manufacturer. While that's great for us, their annualized failure rates over time are sadly not normal.

Can AI help?

The idea of using AI/ML techniques to predict drive failure has been around for several years, but as a first step let's see if predicting drive failure is even an AI-worthy problem. We recently conducted a webinar "Leveraging Your Cloud Storage Data in AL/ML Apps and Services" in which we outlined general criteria to be used in evaluating if AI/ML is needed to solve a given problem, in this case predicting drive failure. The most salient criteria which applies here is that AI is best used for a problem for which you can not consistently apply a set of rules to solve the problem.

A model is trained by taking the source data and applying an algorithm to iteratively combine and weigh multiple factors. The output is a model which can be used to answer questions about the model's subject matter, in this case drive failure. For example, we train a model using the Drive Stats data for a given drive model for the last year. Then, we ask the model a question using drive Z's daily SMART stats and related information. We use this data as input to the model, and while there is no exact match, the model will use inference to develop a response of the probability of drive failure for drive Z over time. As such, it would seem that drive failure prediction would be a good candidate for using AI.

What's not clear is whether what is learned about one drive model can be applied to another drive model. One look at the snake chart above visualizes the issue as the failure profile for each drive model is different, sometimes radically different. For example, do you think you could train a model on the 4TB Seagate drives (black line) and use it to predict drive failures for either of the 4TB HGST drive models (purple and brown lines)? The answer may be yes, but it certainly doesn't seem likely.

All that said, several research papers and studies have been published over the years attempting to determine whether or not AI/ML can be used to make drive failure predictions. We'll be doing a review of these publications in the next couple of months and hopefully shed some light on the ability to use AI to accurately make drive failure predictions in a timely manner.

The Hard Drive Stats data

It has now been over 11 years since we began recording, storing, and reporting the operational statistics of the hard drives and SSDs we use to store data in the Backblaze data storage cloud. We look at the telemetry data of the drives, including their SMART stats and other health related attributes. We do not read or otherwise examine the actual customer data stored.

Over the years, we have analyzed the data we have gathered and published our findings and insights from our analyses. For transparency, we also publish the data itself, known as the Drive Stats dataset. This dataset is open source and can be downloaded from our Drive Stats webpage.

print