08/08/2024 | News release | Distributed by Public on 08/08/2024 21:42
One of the first key tenets of cloud computing was that "you own your own availability", the idea being that the public cloud providers were making infrastructure available to you, and your organization had to decide what to use and how to use it in order to meet your organization's goals. The cloud providers have no knowledge of your applications or their KPIs.
Over the last 10 years or so, more organizations have become increasingly more reliant on cloud computing facilities and other SaaS providers for many core functions of their technical stack. That's been great! Teams get to focus on the core business features that create value and provide an individual business with revenue without worrying about many of the more mundane requirements of their tech stack.
This dependence has brought risk. Cloud providers have experienced outages due to configuration errors, distributed denial of service attacks(DDOS), and even catastrophic fires.
How should a team handle an incident that lies with an upstream provider? What can we bring from our experiences handling our own incidents?
We won't be able to fix these types of incidents on our own. Many teams will have to sit and wait out the problem. Others will weigh the cost of a migration or failover, and some will have already done so by the time the rest of us notice there's an issue.
Who Owns the Vendor Relationship During an Incident?
Managing vendor relationships often falls to a procurement, finance, or legal team. So much of vendor management is about contracts, payment terms, and SLAs. During a vendor incident, though, the teams integrating directly with the vendor's products need to be in the loop for vendor communications.
If your cloud infrastructure vendor is experiencing an outage, maybe your SRE team will be on top of notifications and status updates; if your billing vendor is involved, probably the team that manages your payment processing flow. Developer tools or Developer Experience teams may be on the lookout for problems with version control systems, build and deploy, or monitoring systems.
Knowing in advance which teams are responsible for which vendor relationships is important for being able to verify that your organization is or is not impacted by a vendor incident, knowing when the incident has been fully mitigated and service completely restored, and for determining what impact the incident had on your users.
Keep this information handy and make sure it is up to date as part of your incident preparedness. In PagerDuty, you can even define a servicerepresenting a vendor and add contact information, runbooks, and other data to the service definition to help your response, as well as an escalation policy that notifies the team that interfaces with the vendor.
Get Your Info from the Source
For large incidents and major outages, the events are often the main tech news story of the day. Information will be in the mainstream media, on social media, and on specialized mailing lists dedicated to particular products, or just outagesin general.
For your primary vendors - services that sit in your productivity or revenue-generating paths - know if they host a status pageand where it lives. Best practice suggests that these status pages be hosted off their main domain names, so you might not find them at company.com/status. They might also have dedicated social media accounts devoted to service status updates.
If they don't have a status page, they might have a customer notification email list that you'll need to subscribe to.
Your organization's chat platform also probably allows your team to integrate with your vendor status pages, providing another avenue for teammates to determine if an incident is happening on the vendor.
Additionally, there are now a number of third-party reporting platforms that provide additional information:
Your Vendor Runbook
When a vendor has an incident, as a customer, you'll want some information at hand. Establish a runbook for your key vendors so you'll know who to contact and how.
Note key information in your runbook:
Also note in your vendor runbook if you have an idea of when it will be important to contact the vendor at all. During large outages that impact hundreds or even thousands of customers, you might not need or want to contact the vendor, but rely on the public status information. For incidents that don't have indications of larger impact, your teams will want to reach out.
While You Wait
Public incidents can be super interesting to folks in your organization. They are dramatic! They're in the news! Everyone is distracted!
Incidents can be a huge waste of time across your organization for those reasons. If people feel like they can't get work done because a vendor is having an incident, your team needs a communications plan to keep people informed.
Your Major Incident workflows can help you keep distractions to a minimum, even when your team isn't actively managing a remediation.
Many vendor incidents are resolved in a relatively timely manner. Large, complex systems like AWS, Azure, and even GitHub have smaller incidents around some subsystems fairly regularly. These are easy enough to wait out, though they may impact your productivity. Some things to consider for these incidents:
For larger, more widespread, or longer-running incidents, your disaster recovery (DR) plan may be needed. Hopefully you've practiced it recently!
You likely won't have full coverage for a DR plan. It's rare to have full redundancy of all your providers, at least in the short term. The ability to switch version control system providers or build and deploy providers, even during longer outages, is hard and expensive.
Infrastructure and data DR plans are more common, and what many folks have in mind when they are owning their own availability. Your DR plan may include any number of features, but some basics to keep in mind include:
Your Post-Vendor-Incident Review
After a significant vendor incident, your team will be in a place to decide if the vendor has lost your trust as a customer. At this point your folks in procurement, finance, or legal should be involved to determine if SLAs were violated and your company is owed a credit or refund from the vendor.
The teams utilizing the vendor should evaluate whether the incident was impactful enough to trigger a vendor change. Weighing the cost of incident(s) against the switching costs and available features should be handled after the incident is concluded, when the team can fully evaluate how the vendor handled the incident from start to finish.
As with any PIR, determine if your actions were effective and make any updates needed to your vendor runbook:
Conclusion
Vendor incidents are stressful, not only because of their potential impact on our organizations, but often because of the feeling of helplessness our responders feel when issues are out of their hands. Preparing in advance for vendor issues will help keep your teams informed and make recovery more efficient.
Check out this comprehensive checklist designed to help you identify and address critical gaps in your incident management process.