06/08/2024 | Press release | Distributed by Public on 06/08/2024 20:20
As of the end of Q2 2024, Backblaze was monitoring 288,665 hard drives (HDDs) and solid state drives (SSDs) in our cloud storage servers located in our data centers around the world. We removed from this analysis 3,789 boot drives, consisting of 2,923 SSDs and 866 hard drives. This leaves us with 284,876 hard drives under management to review for this report. We'll review the annualized failure rates (AFRs) for Q2 2024 and the lifetime AFRs of the qualifying drive models, and we'll also check out drive age versus failure rates over time. Along the way, we'll share our observations and insights on the data presented and, as always, we look forward to you doing the same in the comments section at the end of the post.
For our Q2 2024 quarterly analysis, we remove from consideration: drive models which did have at least 100 drives in service at the end of the quarter, drive models which did not accumulate 10,000 or more drive days during the quarter, and individual drives which exceeded their manufacturer's temperature specification during their lifetime. The removed pool totalled 490 drives, leaving us with 284,386 drives grouped into 29 drive models for our Q2 2024 analysis.
The table below lists the AFRs and related data for these drive models. The table is sorted large to small by drive size then by AFR within drive size.
As of the end of Q2 2024, we were tracking 284,876 operational hard drives. To be considered for the lifetime review, a drive model was required to have 500 or more drives as of the end of Q2 2024 and have over 100,000 accumulated drive days during their lifetime. When we removed those drive models which did not meet the lifetime criteria, we had 283,065 drives grouped into 25 models remaining for analysis as shown in the table below.
One of the truisms in our business is that different drive models fail at different rates. Our goal is to develop a failure profile for a given drive model over time. Such a profile can help optimize our drive replacement and migration strategies, and ultimately maintains the durability of our cloud storage service.
For our cohort of data drives, we'll look at the changes in the lifetime AFR over time for drive models with at least one million drive days as of the end of Q2 2024. This gives us 23 drive models to review. We'll divide the drive models into two groups: those whose average age is five years (60 months) or less, and those whose average age is above 60 months. Why that cutoff? That's the typical warranty period for enterprise class hard drives.
Let's start by plotting the current lifetime AFR for the 14 drives models that have an average age of 60 months or less as shown in the chart below.
Let's review the drive models by characterizing the four quadrants as follows:
At a glance, the chart tells us that everything seems fine. The drives in Quadrant I are performing well, the two drives in Quadrant II could be better, but are still acceptable, and there are no surprises in the newer drive models to this point. Let's see how things fair for the drive models which have an average age of over 60 months as in the chart below.
There are nine drive models which fit the average age criteria, including the Seagate 6TB drive (in yellow) whose drives were removed from service in Q2. As you can see the drive models are spread out across all four quadrants. As before, Quadrant I contains good drives, Quadrants II and III are drives we need to worry about, and Quadrant IV models look good so far.
If we were to stop here we could decide for example that the 4TB Seagate drives are first in line for the CVT migration process, but not so fast. All of these drive models have been around for at least five years and we have their failure rates over time. So, rather than rely on just a point in time, let's look at their change in failure rates over time in the chart below.
The snake chart, as we're calling it, shows the lifetime failure rate of each drive model over time. We started at 24 months to make the chart less messy. Regardless, the drive models sort themselves out into either Quadrant I or II once their average age passes 60 months. Let's take a look at the drives in each of those quadrants.
If we had to pick one of the drive models to represent a normal failure profile, it would be the 8TB Seagate (blue line, model: ST800DM002). Why? The failure rate for the first 60 months was consistently around 1.0%, Seagate's predicted AFR. After 60 months, the AFR increased as the drive aged as one would expect. You might have thought we'd choose the failure profile of one of the two 4TB HGST drive models (brown and purple lines). The "trouble" is their failure rates are well below any published AFR by any drive manufacturer. While that's great for us, their annualized failure rates over time are sadly not normal.
The idea of using AI/ML techniques to predict drive failure has been around for several years, but as a first step let's see if predicting drive failure is even an AI-worthy problem. We recently conducted a webinar "Leveraging Your Cloud Storage Data in AL/ML Apps and Services" in which we outlined general criteria to be used in evaluating if AI/ML is needed to solve a given problem, in this case predicting drive failure. The most salient criteria which applies here is that AI is best used for a problem for which you can not consistently apply a set of rules to solve the problem.
A model is trained by taking the source data and applying an algorithm to iteratively combine and weigh multiple factors. The output is a model which can be used to answer questions about the model's subject matter, in this case drive failure. For example, we train a model using the Drive Stats data for a given drive model for the last year. Then, we ask the model a question using drive Z's daily SMART stats and related information. We use this data as input to the model, and while there is no exact match, the model will use inference to develop a response of the probability of drive failure for drive Z over time. As such, it would seem that drive failure prediction would be a good candidate for using AI.
What's not clear is whether what is learned about one drive model can be applied to another drive model. One look at the snake chart above visualizes the issue as the failure profile for each drive model is different, sometimes radically different. For example, do you think you could train a model on the 4TB Seagate drives (black line) and use it to predict drive failures for either of the 4TB HGST drive models (purple and brown lines)? The answer may be yes, but it certainly doesn't seem likely.
All that said, several research papers and studies have been published over the years attempting to determine whether or not AI/ML can be used to make drive failure predictions. We'll be doing a review of these publications in the next couple of months and hopefully shed some light on the ability to use AI to accurately make drive failure predictions in a timely manner.
It has now been over 11 years since we began recording, storing, and reporting the operational statistics of the hard drives and SSDs we use to store data in the Backblaze data storage cloud. We look at the telemetry data of the drives, including their SMART stats and other health related attributes. We do not read or otherwise examine the actual customer data stored.
Over the years, we have analyzed the data we have gathered and published our findings and insights from our analyses. For transparency, we also publish the data itself, known as the Drive Stats dataset. This dataset is open source and can be downloaded from our Drive Stats webpage.