PagerDuty Inc.

08/08/2024 | News release | Distributed by Public on 08/08/2024 21:42

Managing Vendor Incidents: Customer Impact That Isn’t Your Fault

One of the first key tenets of cloud computing was that "you own your own availability", the idea being that the public cloud providers were making infrastructure available to you, and your organization had to decide what to use and how to use it in order to meet your organization's goals. The cloud providers have no knowledge of your applications or their KPIs.

Over the last 10 years or so, more organizations have become increasingly more reliant on cloud computing facilities and other SaaS providers for many core functions of their technical stack. That's been great! Teams get to focus on the core business features that create value and provide an individual business with revenue without worrying about many of the more mundane requirements of their tech stack.

This dependence has brought risk. Cloud providers have experienced outages due to configuration errors, distributed denial of service attacks(DDOS), and even catastrophic fires.

How should a team handle an incident that lies with an upstream provider? What can we bring from our experiences handling our own incidents?

We won't be able to fix these types of incidents on our own. Many teams will have to sit and wait out the problem. Others will weigh the cost of a migration or failover, and some will have already done so by the time the rest of us notice there's an issue.

Who Owns the Vendor Relationship During an Incident?

Managing vendor relationships often falls to a procurement, finance, or legal team. So much of vendor management is about contracts, payment terms, and SLAs. During a vendor incident, though, the teams integrating directly with the vendor's products need to be in the loop for vendor communications.

If your cloud infrastructure vendor is experiencing an outage, maybe your SRE team will be on top of notifications and status updates; if your billing vendor is involved, probably the team that manages your payment processing flow. Developer tools or Developer Experience teams may be on the lookout for problems with version control systems, build and deploy, or monitoring systems.

Knowing in advance which teams are responsible for which vendor relationships is important for being able to verify that your organization is or is not impacted by a vendor incident, knowing when the incident has been fully mitigated and service completely restored, and for determining what impact the incident had on your users.

Keep this information handy and make sure it is up to date as part of your incident preparedness. In PagerDuty, you can even define a servicerepresenting a vendor and add contact information, runbooks, and other data to the service definition to help your response, as well as an escalation policy that notifies the team that interfaces with the vendor.

Get Your Info from the Source

For large incidents and major outages, the events are often the main tech news story of the day. Information will be in the mainstream media, on social media, and on specialized mailing lists dedicated to particular products, or just outagesin general.

For your primary vendors - services that sit in your productivity or revenue-generating paths - know if they host a status pageand where it lives. Best practice suggests that these status pages be hosted off their main domain names, so you might not find them at company.com/status. They might also have dedicated social media accounts devoted to service status updates.

If they don't have a status page, they might have a customer notification email list that you'll need to subscribe to.

Your organization's chat platform also probably allows your team to integrate with your vendor status pages, providing another avenue for teammates to determine if an incident is happening on the vendor.

Additionally, there are now a number of third-party reporting platforms that provide additional information:

  • Downdetector, Down for Everyone or Just Me, and others - track outages for large commercial sites as well as mobile providers. These are super user-friendly and helpful for folks who aren't sure if the problem they're seeing is just on their end or more widespread.
  • The Internet Weather Mapreports on network lag globally. Helpful if your customers are worldwide. More for the network nerds.

Your Vendor Runbook

When a vendor has an incident, as a customer, you'll want some information at hand. Establish a runbook for your key vendors so you'll know who to contact and how.

Note key information in your runbook:

  • Your organization's account numbers or IDs so they can be referenced when contacting support.
  • Email addresses or contact information for your account managers and the vendor's support team.
  • Contract information such as packages and features you've purchased, as well as the level of support you have, if applicable. If you have an elevated support package, you want to be aware of that; it may include special contact points.
  • Status of your account and renewal date. Make sure your account isn't expired before reporting an issue.
  • Any vendor-specific reporting requirements, like error codes or stack traces that might be helpful to gather.

Also note in your vendor runbook if you have an idea of when it will be important to contact the vendor at all. During large outages that impact hundreds or even thousands of customers, you might not need or want to contact the vendor, but rely on the public status information. For incidents that don't have indications of larger impact, your teams will want to reach out.

While You Wait

Public incidents can be super interesting to folks in your organization. They are dramatic! They're in the news! Everyone is distracted!

Incidents can be a huge waste of time across your organization for those reasons. If people feel like they can't get work done because a vendor is having an incident, your team needs a communications plan to keep people informed.

Your Major Incident workflows can help you keep distractions to a minimum, even when your team isn't actively managing a remediation.

  • Establish the internal point of contact. Designate someone from the team that owns the relationship to stay in touch with the vendor or to monitor the vendor's status. Pass this responsibility off after a few hours if the incident persists.
  • Establish how information will be shared. Use your existing stakeholder communications channels, so your team isn't searching for information somewhere unexpected.
  • If a vendor incident has impacts on your customers, liaise with your support teams for customer notifications and your own status updates.

Many vendor incidents are resolved in a relatively timely manner. Large, complex systems like AWS, Azure, and even GitHub have smaller incidents around some subsystems fairly regularly. These are easy enough to wait out, though they may impact your productivity. Some things to consider for these incidents:

  • Decide when or if your team should call a deploy freeze, and who will have the authority to make that decision, including executive-level support.
  • Determine where internal communication will happen. Make sure everyone knows what is happening.
  • Designate a team member to monitor the vendor status and give the all clear.

For larger, more widespread, or longer-running incidents, your disaster recovery (DR) plan may be needed. Hopefully you've practiced it recently!

You likely won't have full coverage for a DR plan. It's rare to have full redundancy of all your providers, at least in the short term. The ability to switch version control system providers or build and deploy providers, even during longer outages, is hard and expensive.

Infrastructure and data DR plans are more common, and what many folks have in mind when they are owning their own availability. Your DR plan may include any number of features, but some basics to keep in mind include:

  • Know when to declare a disaster and initiate a failover. Establish thresholds for customer impact, revenue impact, and other key metrics.
  • Establish executive responsibility and communications.
  • Initiate a Major Incident, or DR Incident if you have one, so all teams are on alert.
  • Have predetermined success and QA tests ready to go.

Your Post-Vendor-Incident Review

After a significant vendor incident, your team will be in a place to decide if the vendor has lost your trust as a customer. At this point your folks in procurement, finance, or legal should be involved to determine if SLAs were violated and your company is owed a credit or refund from the vendor.

The teams utilizing the vendor should evaluate whether the incident was impactful enough to trigger a vendor change. Weighing the cost of incident(s) against the switching costs and available features should be handled after the incident is concluded, when the team can fully evaluate how the vendor handled the incident from start to finish.

As with any PIR, determine if your actions were effective and make any updates needed to your vendor runbook:

  • Was all of your information up to date?
  • Were your communications methods from the vendor and internal to your teams effective?
  • Were you able to recover functionality when the vendor claimed service to be restored, or were there other actions required?
  • Was there anything else that slowed down your notice of the incident or recovery afterwards?

Conclusion

Vendor incidents are stressful, not only because of their potential impact on our organizations, but often because of the feeling of helplessness our responders feel when issues are out of their hands. Preparing in advance for vendor issues will help keep your teams informed and make recovery more efficient.

Check out this comprehensive checklist designed to help you identify and address critical gaps in your incident management process.