PagerDuty Inc.

09/10/2024 | News release | Distributed by Public on 09/11/2024 03:15

Myth vs. Reality: Lessons in Reliability from the July 19 Outage

It was 3AM at Newark Liberty International Airport. I was groggy, waiting in line to get my boarding pass, only to be met with a blue screen on the check-in kiosk. Needing some coffee, I learned the vendor was only accepting cash.

There was clearly a big outage and I quickly checked our systems at PagerDuty. Major outages happen multiple times per year, so frequently that we have an internal dashboard (colloquially referred to as "the internets are broken"). Sure enough, the chart indicated I was not the only one having a bad customer experience that day. And indeed, a quick check of my PagerDuty app on my phone indicated all systems were fine. Relieved, I went back to drinking my coffee and catching up on some reading.

Then, as fate would have it, the person sitting next to me got paged. Rather than panic, however, that familiar PagerDuty notification sound reassured me that our platform was doing exactly what it was designed to do: keep critical services running when everything else was falling apart.

This moment highlighted a key leadership lesson. As a leader, you need a system of practices - one that you can be assured will work even when you are unreachable - stuck in an airport with no coffee - to create a reliable customer experience for your customers. When the stakes are high, what truly matters is not just having a system in place but having a system of systems- an integrated framework of technologies, processes, and protocols that ensures resilience in the face of potential failure.

There is a common misconception about reliability, one that often surfaces in conversations with customers and peers. This idea assumes that reliability is a singular trait or that we can achieve it through a simple redundancy. While some might think that just having a backup server or redundant cloud equates to reliability, the reality is much more complex.

Myth #1: Redundancy Equals Reliability

One of the most persistent myths about reliability is that if you have two of everything, you're covered. In this oversimplified view, having two cloud providers, two servers, or two databases should make your system invincible. However, this mindset ignores the complexities that arise when you introduce multiple systems that must work in tandem.

Adding more components doesn't necessarily make your system more reliable. In fact, it can introduce new points of failure. When two systems have any dependencies between them, the probability of failure does not decrease by half; it increases. Add in the complexity of modern connected systems and the complexity of a system alone can become the biggest reliability risk.

Myth #2: Preventing Failure is the Only Goal

Another common misconception is that the key to reliability is preventing failure altogether. If you can prevent failures, then you won't need to worry about downtime, right?

Unfortunately, no system, regardless of how well designed, is immune to failure. Rather than designing systems with the unrealistic expectation that they will never fail, we design systems under the principle of "assume failure" and handle failure gracefully. This approach involves implementing safeguards like failure masking, where failing components are isolated and do not affect the overall system, and bounded impact, which limits the effect of failures to small, manageable areas of the infrastructure.

A great example of this approach in action is our practice of "Failure Fridays." We routinely test for failure modes not just in staging but in production environments. We simulate a variety of failure scenarios to ensure that our systems are resilient under real-world conditions, including everything from server crashes to network failures. It's a practice that allowed us to confidently handle the July 19th spike in traffic without major outages.

Myth #3: More Responders Equals Faster Resolution

In times of crisis, many organizations believe that the more people they bring into the incident response process, the faster the resolution will be. It's a logical assumption: more brains should equal faster problem-solving.

But the reality is often the opposite. Adding too many responders to an incident can lead to confusion, duplicate work, and bottlenecks in communication. At PagerDuty, we've learned that automation is often a more effective way to improve resolution times than simply adding more human responders.

During the July outage, one of our customers shared an anecdote with us that perfectly illustrated this approach. His team had just started implementing AIOps when the outage hit. Before AIOps, they dealt with a manageable amount of system noise, but during the outage, that noise ballooned into an overwhelming wave of alerts. However, because they were automating alert management, they were able to quickly cut through the noise and identify the root cause of the issues, giving their human responders the space to focus on high-priority incidents and ensure that critical systems were restored.

Resilience is the Sum of Many Parts

Ultimately, reliability is not a one-time achievement - it's an ongoing process. This is why we've invested in automation, AI-driven insights, and continuous testing.

Our commitment and work paid off. Our platform delivered more than 60,000 notifications per minute during the peak of the outage, all while staying within our 15-second average notification time. We maintained our internal SLOs, and customers resolved incidents just 29% slower than on a normal day despite a 192% spike in incident volume. And for responders who were configured for mobile and Slack, notifications happened in under 500 milliseconds - literally in the blink of an eye.

What are some things you can do instead in your system design choices? We try to follow an in-depth framework of reliability in our system design. Which is, assuming you did many other things in both your process and system design to avoid failure - things still fail. So if you assume a certain amount of failure is inevitable, you can take other strategies to create resilience. For instance, you can maskthe failure. Examples of this strategy in action are cluster orchestration, automated failovers, state replication, and leader election. At PagerDuty, many of our systems employ these strategies, and we further test and validate them in production with regular failure anyday tests to ensure they work as designed.

Sometimes, masking failure isn't always possible, or it doesn't catch all use cases. The next thing you can do is boundthe failure. The best and easiest way to do this is with canaries and phased rollouts. We do phased rollouts for all changes, whether infrastructure or feature releases, gradually increasing traffic.

If you can't avoid, mask, or bound failure, the next goal is to fix it fast. This is obviously where a good, responsive incident response process comes into play. And that is certainly true - the faster you can recover your systems, the faster you can restore services for your customers and mission. Another use case here is fast rollbacks, that allow you to rollback changes during an incident quickly. This is a great companion task to canaries. Case in point: every team at PagerDuty is required to have a rollback mechanism that can be executed in under five minutes. These types of automation help us respond to incidents faster and keep our customers' critical systems up and running. I also often share this as a great measurable and easy to action first step any organization can take to improve their reliability and responsiveness using our automation tools - have every single team implement change events, canaries, and fast rollbacks, and during your incident response process, you can empower your incident commanders to execute a teams rollback script if it looks like the incident is caused by a recent change (as correlated through our change events). Internally, having these tools available to us prevents at least a half dozen incidents a year that we can catch, bound, rollback and restore before they become a major incident.

The lessons learned from the July 19 outage reaffirm that true reliability is not about any single component or quick fix. It's about building a resilient system of systems, one that assumes failure and is designed to recover from it quickly. We're committed to helping our customers achieve this level of resilience, and we believe that with the right strategies, tools, and mindset, any organization can achieve true operational reliability - no matter what challenges the future holds.

Want to learn more about preparing for outages? Check out this webinar: Learn from Incidents to Stay Prepared for the Next Outage.